Abstract
Video-based multi-modal person identification has attracted rising research interest recently to address the inadequacies of single-modal identification in unconstrained scenes. Most existing methods model video-level and multi-modal-level information of target video respectively, which suffer from separation of different levels and insufficient information contained in a specific video. In this paper, we introduce extra neighbor-level information for the first time to enhance the informativeness of target video. Then a Multi-Level(neighbor-level, multi-modal-level, and video-level) and Multi-Modal GCN model is proposed, to capture correlation among different levels and achieve adaptive fusion in a unified model. Experiments on iQIYI-VID-2019 dataset show that MMM-GCN significantly outperforms current state-of-the-art methods, proving its superiority and effectiveness. Besides, we point out feature fusion is heavily polluted by noisy nodes that result in a suboptimal result. Further improvement could be explored on this basis to approach the performance upper bound of our paradigm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Chen, J., Yang, L., Xu, Y., Huo, J., Shi, Y., Gao, Y.: A novel deep multi-modal feature fusion method for celebrity video identification. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2535–2538 (2019)
Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolutional networks. In: International Conference on Machine Learning, pp. 1725–1735. PMLR (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Dong, C., Gu, Z., Huang, Z., Ji, W., Huo, J., Gao, Y.: DeepMEF: a deep model ensemble framework for video based multi-modal person identification. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2531–2534 (2019)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon technical report n 93, 27403 (1993)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Hu, J., Liu, Y., Zhao, J., Jin, Q.: MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint arXiv:2107.06779 (2021)
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In: Workshop on Faces in’Real-Life’Images: Detection, Alignment, and Recognition (2008)
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–441 (2018)
Huang, W., Zhang, T., Rong, Y., Huang, J.: Adaptive sampling towards fast graph representation learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Huang, Z., Chang, Y., Chen, W., Shen, Q., Liao, J.: Residual dense network: a simple approach for video person identification. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2521–2525 (2019)
Joon Oh, S., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3862–3870 (2015)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Li, F., Wang, W., Liu, Z., Wang, H., Yan, C., Wu, B.: Frame aggregation and multi-modal fusion framework for video-based person recognition. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12572, pp. 75–86. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67832-6_7
Lin, R., Xiao, J., Fan, J.: NeXtVLAD: an efficient neural network to aggregate frame-level features for large-scale video classification. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Liu, Y., et al.: iQIYI-VID: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)
Liu, Y., et al.: iQIYI celebrity video identification challenge. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2516–2520 (2019)
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)
Nguyen, B.X., Nguyen, B.D., Do, T., Tjiputra, E., Tran, Q.D., Nguyen, A.: Graph-based person signature for person re-identifications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3492–3501 (2021)
Shen, S., et al.: Structure-aware face clustering on a large-scale graph with 107 nodes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9085–9094 (2021)
Tao, Z., Wei, Y., Wang, X., He, X., Huang, X., Chua, T.S.: MGAT: multimodal graph attention network for recommendation. Inf. Process. Manag. 57(5), 102277 (2020)
Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 274–282 (2018)
Zhong, Y., Arandjelović, R., Zisserman, A.: GhostVLAD for set-based face recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11362, pp. 35–50. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20890-5_3
Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liao, Z., Di, D., Hao, J., Zhang, J., Zhu, S., Yin, J. (2023). MMM-GCN: Multi-Level Multi-Modal Graph Convolution Network for Video-Based Person Identification. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-27077-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)