Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Rao, Yongming; Lu, Jiwen; Zhou, Jie

doi:10.1007/s11263-018-1135-x

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Published: 28 November 2018

Volume 127, pages 701–718, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yongming Rao¹,
Jiwen Lu¹ &
Jie Zhou¹

1782 Accesses
41 Citations
Explore all metrics

Abstract

In this paper, we propose a discriminative aggregation network method for video-based face recognition and person re-identification, which aims to integrate information from video frames for feature representation effectively and efficiently. Unlike existing video aggregation methods, our method aggregates raw video frames directly instead of the features obtained by complex processing. By combining the idea of metric learning and adversarial learning, we learn an aggregation network to generate more discriminative images compared to the raw input frames. Our framework reduces the number of image frames per video to be processed and significantly speeds up the recognition procedure. Furthermore, low-quality frames containing misleading information can be well filtered and denoised during the aggregation procedure, which makes our method more robust and discriminative. Experimental results on several widely used datasets show that our method can generate discriminative images from video clips and improve the overall recognition performance in both the speed and the accuracy for video-based face recognition and person re-identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combine Coarse and Fine Cues: Multi-grained Fusion Network for Video-Based Person Re-identification

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Discriminative feature extraction for video person re-identification via multi-task network

Article 02 September 2020

Wanru Song, Jieying Zheng, … Feng Liu

References

Baltieri, D., Vezzani, R., & Cucchiara, R. (2011). 3dpes: 3d people dataset for surveillance and forensics. In Proceedings of the 2011 joint ACM workshop on human gesture and behavior understanding, ACM, pp. 59–64.
Beveridge, J. R., Phillips, P. J., Bolme, D. S., Draper, B. A., Givens, G. H., Lui, Y. M., et al. (2013). The challenge of face recognition from digital point-and-shoot cameras. In 2013 IEEE sixth international conference on BTAS, pp. 1–8.
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face and gesture recognition (FG 2018), IEEE, pp. 67–74.
Cevikalp, H., & Triggs, B. (2010). Face recognition based on image sets. In 2010 IEEE conference on CVPR, pp. 2567–2573.
Chen, X., Duan, Y., Houthooft, R., Schulman, J,, Sutskever, I., & Abbeel, P. (2016b). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, pp. 2172–2180.
Chen, J. C., Patel, V. M., & Chellappa, R. (2016a). Unconstrained face verification using deep CNN features. In 2016 IEEE winter conference on applications of computer vision (WACV), pp. 1–9.
Chen, J. C., Ranjan, R., Kumar, A., Chen, C. H., Patel, V. M., & Chellappa, R. (2015). An end-to-end system for unconstrained face verification with deep convolutional neural networks. In Proceedings of the IEEE international conference on computer vision workshops, pp. 118–126.
Chen, Y. C., Patel, V. M., Phillips, P. J., & Chellappa, R. (2012). Dictionary-based face recognition from video, Springer, Berlin, pp. 766–779 .
Ding, C., & Tao, D. (2017). Trunk-branch ensemble convolutional neural networks for video-based face recognition. In PAMI.
Dong, C., Loy, C. C., He, K., & Tang, X. (2014). Learning a deep convolutional network for image super-resolution. In ECCV, Springer, pp. 184–199.
Dong, C., Loy, C. C., He, K., & Tang, X. (2016). Image super-resolution using deep convolutional networks. T-PAMI, 38(2), 295–307.
Article Google Scholar
Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010). Cascade object detection with deformable part models. In 2010 IEEE conference on CVPR, IEEE, pp. 2241–2248.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS, pp. 2672–2680.
Gray, D., Brennan, S., & Tao, H. (2007). Evaluating appearance models for recognition, reacquisition, and tracking. In PETS, Citeseer, Vol. 3, pp. 1–7.
Guillaumin, M., Verbeek, J., & Schmid, C. (2009). Is that you? Metric learning approaches for face identification. In ICCV, pp. 498–505.
Hassner, T., Masi, I., Kim, J., Choi, J., Harel, S., Natarajan, P., et al. (2016). Pooling faces: Template based face recognition with pooled face images. In CVPRW, pp. 59–67.
Hayat, M., Bennamoun, M., & An, S. (2015). Deep reconstruction models for image set classification. PAMI, 37(4), 713–727.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pp. 1026–1034.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.
Hermans, A., Beyer, L., Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
Hirzer, M., Beleznai, C., Roth, P. M., & Bischof, H. (2011). Person re-identification by descriptive and discriminative classification, Springer, Berlin, pp. 91–102.
Hu, J., Lu, J., & Tan, Y. P. (2014a). Discriminative deep metric learning for face verification in the wild. In CVPR, pp. 1875–1882.
Hu, J., Lu, J., Yuan, J., & Tan, Y. P. (2014b). Large margin multi-metric learning for face and kinship verification in the wild. In ACCV, pp. 252–267.
Hu, Y., Mian, A. S., & Owens, R. (2011). Sparse approximated nearest points for image set classification. In Computer vision and pattern recognition, pp. 121–128.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. IEEE Conference on Computer Vision and Pattern Recognition.
Huang, Z., & Van Gool, L. (2016). A riemannian network for SPD matrix learning. arXiv preprint arXiv:1608.04233.
Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst.
Huang, Z., Wang, R., Shan, S., & Chen, X. (2014). Learning euclidean-to-riemannian metric for point-to-set classification. In CVPR, pp. 1677–1684.
Huang, Z., Wang, R., Shan, S., Li, X., & Chen, X. (2015). Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In ICML, pp. 720–729.
Huang, Z., Wu, J., & Van Gool, L. (2016). Building deep networks on grassmann manifolds. arXiv preprint arXiv:1611.05742.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Ionescu, C., Vantzos, O., & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV, pp. 2965–2973.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2016). Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004.
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In NIPS, pp. 2017–2025.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM-MM, pp. 675–678.
Kawanishi, Y., Wu, Y., Mukunoki, M., & Minoh, M. (2014). Shinpuhkan2014: A multi-camera pedestrian dataset for tracking people across multiple cameras. In 20th Korea-Japan joint workshop on frontiers of computer vision (Vol. 5, p. 6).
Kim, M., Kumar, S., Pavlovic, V., & Rowley, H. (2008). Face tracking and recognition with visual constraints in real-world videos. In CVPR, pp. 1–8.
Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., et al. (2015). Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR, pp. 1931–1939.
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2015). Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al. (2016). Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802.
Li, W., & Wang, X. (2013). Locally aligned feature transforms across views. In CVPR, pp. 3594–3601.
Li, H., Hua, G., Shen, X., Lin, Z., & Brandt, J. (2014a). Eigen-pep for video face recognition. In ACCV, pp. 17–33.
Li, W., Zhao, R., Xiao, T., & Wang, X. (2014b). Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, pp. 152–159.
Lin, J., Ren, L., Lu, J., Feng, J., & Zhou, J. (2017). Consistent-aware deep learning for person re-identification in a camera network. In CVPR, pp. 5771–5780.
Liu, Y., Yan, J., & Ouyang, W. (2017). Quality aware network for set to set recognition. In CVPR, Vol. 2, p. 8.
Lu, J., Wang, G., & Moulin, P. (2013). Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning. In ICCV, pp. 329–336.
Lu, J., Wang, G., Deng, W., Moulin, P., & Zhou, J. (2015). Multi-manifold deep metric learning for image set classification. In CVPR, pp. 1137–1145.
Lu, J., Wang, G., & Moulin, P. (2016). Localized multifeature metric learning for image-set-based face recognition. TCSVT, 26(3), 529–540.
Google Scholar
Lvd, M., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(Nov), 2579–2605.
MATH Google Scholar
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC, Vol. 1, p. 6.
Paszke, A., Gross, S., Chintala, S., & Chanan, G. (2017). Pytorch: Tensors and dynamic neural networks in python with strong GPU acceleration.
Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Rao, Y., Lin, J., Lu, J., & Zhou, J. (2017). Learning discriminative aggregation network for video-based face recognition. In ICCV, pp. 3781–3790.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In ICML, Vol. 3.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In CVPR, pp. 815–823.
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., et al. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883.
Sohn, K., Liu, S., Zhong, G., Yu, X., Yang, M. H., & Chandraker, M. (2017). Unsupervised domain adaptation for face recognition in unlabeled videos. In CVPR, pp. 3210–3218.
Sun, Y., Wang, X., & Tang, X. (2015). Deeply learned face representations are sparse, selective, and robust. In CVPR, pp. 2892–2900.
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Closing the gap to human-level performance in face verification. In CVPR, pp. 1701–1708.
Tesfaye, Y. T., Zemene, E., Prati, A., Pelillo, M., & Shah, M. (2017). Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196.
Tran, L., Yin, X., & Liu, X. (2017). Disentangled representation learning gan for pose-invariant face recognition. In CVPR, Vol. 3, p. 7.
Wang, R., & Chen, X. (2009). Manifold discriminant analysis. In CVPR, pp. 429–436.
Wang, R., Guo, H., Davis, L. S., & Dai, Q. (2012). Covariance discriminative learning: A natural and efficient approach to image set classification. In CVPR, pp. 2496–2503.
Wang, J., Lu, C., Wang, M., Li, P., Yan, S., & Hu, X. (2014). Robust face recognition via adaptive sparse representation. IEEE Transactions on Cybernetics, 44(12), 2368–2378.
Article Google Scholar
Wang, T., Gong, S., Zhu, X., & Wang, S. (2016). Person re-identification by discriminative selection in video ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12), 2501–2514.
Article Google Scholar
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV, pp. 499–515.
Whitelam, C., Taborsky, E., Blanton, A., Maze, B., Adams, J. C., Miller, T., et al. (2017). Iarpa janus benchmark-b face dataset. In Workshops on CVPR, pp. 592–600.
Wolf, L., Hassner, T., & Maoz, I. (2011). Face recognition in unconstrained videos with matched background similarity. In CVPR, pp. 529–534.
Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Transactions on Analysis and Machine Intelligence, 31(2), 210–227.
Article Google Scholar
Xiao, T., Li, H., Ouyang, W., & Wang, X. (2016). Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, pp. 1249–1258.
Yang, J., Ren, P., Chen, D., Wen, F., Li, H., & Hua, G. (2016a). Neural aggregation network for video face recognition. arXiv preprint arXiv:1603.05474.
Yang, M., Wang, X., Liu, W., & Shen, L. (2016b). Joint regularized nearest points for image set based face recognition. In: IVC.
Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. (2016a). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242.
Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016b). Joint face detection and alignment using multitask cascaded convolutional networks. SPL, 23(10), 1499–1503.
Google Scholar
Zhang, W., Hu, S., & Liu, K. (2017). Learning compact appearance representation for video-based person re-identification. arXiv preprint arXiv:1702.06294.
Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., et al. (2016). Mars: A video benchmark for large-scale person re-identification. In ECCV, Springer, pp. 868–884.
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In ICCV, pp. 1116–1124.
Zheng, W. S., Gong, S., & Xiang, T. (2009). Associating groups of people. In BMVC, Vol. 2.
Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. arXiv preprint arXiv:1701.08398.
Zhou, Z., Huang, Y., Wang, W., Wang, L., & Tan, T. (2017). In CVPR, IEEE, pp. 6776–6785.

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant 61672306, Grant U1713214, Grant 61572271, and in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564.

Author information

Authors and Affiliations

State Key Lab of Intelligent Technologies and Systems, Beijing National Research Center for Information Science and Technology (BNRist), Department of Automation, Tsinghua University, Beijing, 100084, China
Yongming Rao, Jiwen Lu & Jie Zhou

Authors

Yongming Rao
View author publications
You can also search for this author in PubMed Google Scholar
Jiwen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals.

Additional information

Communicated by Rama Chellappa, Xiaoming Liu, Tae-Kyun Kim, Fernando De la Torre, Chen Change Loy.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Partial of this work was presented in Rao et al. (2017).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, Y., Lu, J. & Zhou, J. Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification. Int J Comput Vis 127, 701–718 (2019). https://doi.org/10.1007/s11263-018-1135-x

Download citation

Received: 17 February 2018
Accepted: 15 November 2018
Published: 28 November 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11263-018-1135-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Abstract

Access this article

Similar content being viewed by others

Combine Coarse and Fine Cues: Multi-grained Fusion Network for Video-Based Person Re-identification

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Discriminative feature extraction for video person re-identification via multi-task network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Discriminative Aggregation Network for Video-Based Face Recognition and Person Re-identification

Abstract

Access this article

Similar content being viewed by others

Combine Coarse and Fine Cues: Multi-grained Fusion Network for Video-Based Person Re-identification

Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

Discriminative feature extraction for video person re-identification via multi-task network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation