A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer

Zhang, Hui; Yang, Jiewen; Dong, Xingbo; Lv, Xingguo; Jia, Wei; Jin, Zhe; Li, Xuejun

doi:10.1007/978-981-99-8469-5_3

Hui Zhang¹⁵,
Jiewen Yang¹⁶,
Xingbo Dong¹⁷,
Xingguo Lv¹⁵,
Wei Jia¹⁸,
Zhe Jin¹⁷ &
…
Xuejun Li¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14429))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

339 Accesses

Abstract

Video face recognition (VFR) has gained significant attention as a promising field combining computer vision and artificial intelligence, revolutionizing identity authentication and verification. Unlike traditional image-based methods, VFR leverages the temporal dimension of video footage to extract comprehensive and accurate facial information. However, VFR heavily relies on robust computing power and advanced noise processing capabilities to ensure optimal recognition performance. This paper introduces a novel length-adaptive VFR framework based on a recurrent-mechanism-driven Vision Transformer, termed TempoViT. TempoViT efficiently captures spatial and temporal information from face videos, enabling accurate and reliable face recognition while mitigating the high GPU memory requirements associated with video processing. By leveraging the reuse of hidden states from previous frames, the framework establishes recurring links between frames, allowing the modeling of long-term dependencies. Experimental results validate the effectiveness of TempoViT, demonstrating its state-of-the-art performance in video face recognition tasks on benchmark datasets including iQIYI-ViD, YTF, IJB-C, and Honda/UCSD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: 4th International Conference on Learning Representations, ICLR 2016 (2015)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML), vol. 2, p. 4 (2021)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Ding, C., Tao, D.: Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 1002–1014 (2017)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Du, H., Shi, H., Zeng, D., Zhang, X.P., Mei, T.: The elements of end-to-end deep face recognition: a survey of recent advances. ACM Comput. Surv. (CSUR) 54(10s), 1–42 (2022)
Article Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Google Scholar
Gong, S., Shi, Y., Kalka, N.D., Jain, A.K.: Video face recognition: component-wise feature aggregation network (C-FAN). In: 2019 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2019)
Google Scholar
Guo, G., Zhang, N.: A survey on deep learning based face recognition. Comput. Vis. Image Underst. 189, 102805 (2019)
Article Google Scholar
Hajati, F., Tavakolian, M., Gheisari, S., Gao, Y., Mian, A.S.: Dynamic texture comparison using derivative sparse representation: application to video-based face recognition. IEEE Trans. Hum.-Mach. Syst. 47(6), 970–982 (2017)
Article Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Hörmann, S., Cao, Z., Knoche, M., Herzog, F., Rigoll, G.: Face aggregation network for video face recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2973–2977. IEEE (2021)
Google Scholar
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
Google Scholar
Hu, W., Huang, Y., Zhang, F., Li, R., Li, W., Yuan, G.: Seqface: make full use of sequence information for face recognition. arXiv preprint arXiv:1803.06524 (2018)
Kim, S.T., Kim, D.H., Ro, Y.M.: Spatio-temporal representation for face authentication by using multi-task learning with human attributes. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2996–3000. IEEE (2016)
Google Scholar
Kim, S.T., Ro, Y.M.: Facial dynamics interpreter network: what are the important relations between local dynamics for facial trait estimation? In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–480 (2018)
Google Scholar
Lee, K., Ho, J., Yang, M., Kriegman, D.: Visual tracking and recognition using probabilistic appearance manifolds. Comput. Vis. Image Underst. 99(3), 303–331 (2005)
Article Google Scholar
Li, Y., Zheng, W., Cui, Z., Zhang, T.: Face recognition based on recurrent regression neural network. Neurocomputing 297, 50–58 (2018)
Article Google Scholar
Lin, J., Xiao, L., Wu, T., Bian, W.: Image set-based face recognition using pose estimation with facial landmarks. Multimedia Tools Appl. 79(27), 19493–19507 (2020)
Article Google Scholar
Liu, Y., et al.: iQIYI-VID: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)
Maze, B., et al.: IARPA Janus benchmark-C: face dataset and protocol. In: 2018 International Conference on Biometrics (ICB), pp. 158–165. IEEE (2018)
Google Scholar
Mokhayeri, F., Granger, E.: A paired sparse representation model for robust face recognition from a single sample. Pattern Recogn. 100, 107129 (2020)
Article Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)
Rao, Y., Lu, J., Zhou, J.: Attention-aware deep reinforcement learning for video face recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3931–3940 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, H., et al.: CosFace: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018)
Google Scholar
Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: CVPR 2011, pp. 529–534. IEEE (2011)
Google Scholar
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Yang, J., et al.: Neural aggregation network for video face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4362–4371 (2017)
Google Scholar
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., Yu, D.: Recurring the transformer for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14063–14073 (2022)
Google Scholar
Zhang, M., Song, G., Zhou, H., Liu, Yu.: Discriminability distillation in group representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 1–19. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_1
Chapter Google Scholar
Zhong, Y., Deng, W.: Face transformer for recognition. arXiv preprint arXiv:2103.14803 (2021)

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62376003, 62306003, 62372004, 62302005).

Author information

Authors and Affiliations

Anhui Provincial International Joint Research Center for Advanced Technology in Medical Imaging, School of Computer Science and Technology, Anhui University, Hefei, 230093, China
Hui Zhang, Xingguo Lv & Xuejun Li
Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Jiewen Yang
Anhui Provincial Key Laboratory of Secure Artificial Intelligence, School of Artificial Intelligence, Anhui University, Hefei, 230093, China
Xingbo Dong & Zhe Jin
School of Computer Science and Information, Hefei University of Technology, Hefei, 230009, China
Wei Jia

Authors

Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiewen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xingbo Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xingguo Lv
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xuejun Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jiewen Yang or Xingbo Dong .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H. et al. (2024). A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14429. Springer, Singapore. https://doi.org/10.1007/978-981-99-8469-5_3

Download citation

DOI: https://doi.org/10.1007/978-981-99-8469-5_3
Published: 25 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8468-8
Online ISBN: 978-981-99-8469-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer