Skip to main content
Log in

SPL-Net: Spatial-Semantic Patch Learning Network for Facial Attribute Recognition with Limited Labeled Data

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Existing deep learning-based facial attribute recognition (FAR) methods rely heavily on large-scale labeled training data. Unfortunately, in many real-world applications, only limited labeled data are available, resulting in the performance deterioration of these methods. To address this issue, we propose a novel spatial-semantic patch learning network (SPL-Net), consisting of a multi-branch shared subnetwork (MSS), three auxiliary task subnetworks (ATS), and an FAR subnetwork, for attribute classification with limited labeled data. Considering the diversity of facial attributes, MSS includes a task-shared branch and four region branches, each of which contains cascaded dual cross attention modules to extract region-specific features. SPL-Net involves a two-stage learning procedure. In the first stage, MSS and ATS are jointly trained to perform three auxiliary tasks (i.e., a patch rotation task (PRT), a patch segmentation task (PST), and a patch classification task (PCT)), which exploit the spatial-semantic relationship on large-scale unlabeled facial data from various perspectives. Specifically, PRT encodes the spatial information of facial images based on self-supervised learning. PST and PCT respectively capture the pixel-level and image-level semantic information of facial images by leveraging a facial parsing model. Thus, a well-pretrained MSS is obtained. In the second stage, based on the pre-trained MSS, an FAR model is easily fine-tuned to predict facial attributes by requiring only a small amount of labeled data. Experimental results on challenging facial attribute datasets (including CelebA, LFWA, and MAAD) show the superiority of SPL-Net over several state-of-the-art methods in the case of limited labeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability Statement

The datasets that support the findings of this study are available in: CelebA: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, LFWA: http://vis-www.cs.umass.edu/lfw/, MAAD: https://github.com/pterhoer/MAAD-Face, CelebA-HQ: https://github.com/nperraud/download-celebA-HQ

References

  • Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C. (2019). Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249.

  • Cao, J., Li, Y., Zhang, Z. (2018a). Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4290–4299).

  • Cao, Q., Shen, L., Xie, W., Parkhi, O. M, Zisserman, A. (2018b). Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the IEEE international conference on automatic face and gesture recognition (pp. 67–74).

  • Caron, M., Bojanowski, P., Joulin, A., Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (pp. 132–149).

  • Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W. (2021). Pre-trained image processing Transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12299–12310).

  • Chen, J. C., Ranjan, R., Sankaranarayanan, S., Kumar, A., Chen, C. H., Patel, V. M., Castillo, C. D., & Chellappa, R. (2018). Unconstrained still/video-based face verification with deep convolutional neural networks. International Journal of Computer Vision, 126(2), 272–291.

    Article  MathSciNet  Google Scholar 

  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T. S. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5659–5667).

  • Chen, T., Kornblith, S., Norouzi, M., Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the international conference on machine learning (pp. 1597–1607).

  • Egger, B., Schönborn, S., Schneider, A., Kortylewski, A., Morel-Forster, A., Blumer, C., & Vetter, T. (2018). Occlusion-aware 3d morphable models and an illumination prior for face image analysis. International Journal of Computer Vision, 126(12), 1269–1287.

    Article  Google Scholar 

  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3146–3154).

  • Gao, J., Wang, J., Dai, S., Li, L. J., Nevatia, R. (2019). Note-rcnn: Noise tolerant ensemble rcnn for semi-supervised object detection. In Proceedings of the IEEE international conference on computer vision (pp. 9508–9517).

  • Gidaris, S., Singh, P., Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In International conference on learning representations.

  • Hand, E., Chellappa, R. (2017). Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In Proceedings of the AAAI conference on artificial intelligence (pp.1–7).

  • He, K., Zhang, X., Ren, S., Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, K., Zhang, X., Ren, S., Sun, J. (2016b). Identity mappings in deep residual networks. In Proceedings of the European conference on computer vision (pp. 630–645).

  • He, K., Fu, Y., Zhang, W., Wang, C., Jiang, Y. G., Huang, F., Xue, X. (2018a). Harnessing synthesized abstraction images to improve facial attribute recognition. In Proceedings of the international joint conference on artificial intelligence (pp. 733–740).

  • He, K., Fan, H., Wu, Y., Xie, S., Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9729–9738).

  • He, R., Wu, X., Sun, Z., & Tan, T. (2018). Wasserstein cnn: Learning invariant features for nir-vis face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1761–1773.

    Article  Google Scholar 

  • He, X., Wang, P., Zhao, Z., Zhao, Y., Su, F. (2019). Mtcnn with weighted loss penalty and adaptive threshold learning for facial attribute prediction. In Proceedings of the IEEE international conference on multimedia and expo workshops (pp. 180–185).

  • Hu, J., Shen, L., Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

  • Huang, C., Li, Y., Loy, C. C., & Tang, X. (2019). Deep imbalanced learning for face recognition and attribute prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(11), 2781–2794.

    Article  Google Scholar 

  • Huang, G. B., Mattar, M., Berg, T., Learned-Miller, E. (2008). Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In: Workshop on faces in’Real-Life’Images: Detection, alignment, and recognition.

  • Huang, H., Li, Z., He, R., Sun, Z., Tan, T. (2018). Introvae: Introspective variational autoencoders for photographic image synthesis. In Advances in neural information processing systems (pp. 52–63).

  • Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 603–612).

  • Jing, L., & Tian, Y. (2021). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037–4058.

    Article  Google Scholar 

  • Kalayeh, M. M., Gong, B., Shah, M. (2017). Improving facial attribute prediction using semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6942–6950).

  • Karras, T., Aila, T., Laine, S., Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.

  • Kingma, D. P., &Ba, J. (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Li, J., Zhao, F., Feng, J., Roy, S., Yan, S., & Sim, T. (2018). Landmark free face attribute prediction. IEEE Transactions on Image Processing, 27(9), 4651–4662.

    Article  MathSciNet  Google Scholar 

  • Li, Y., Wang, R., Liu, H., Jiang, H., Shan, S., Chen, X. (2015). Two birds, one stone: Jointly learning binary code for large-scale face image retrieval and attributes prediction. In Proceedings of the IEEE international conference on computer vision (pp. 3819–3827).

  • Liu, Z., Luo, P., Wang, X., Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision (pp. 3730–3738).

  • Mahbub, U., Sarkar, S., & Chellappa, R. (2018). Segment-based methods for facial attribute detection from partial faces. IEEE Transactions on Affective Computing, 11(4), 601–613.

    Article  Google Scholar 

  • Mao, L., Yan, Y., Xue, J. H., Wang, H. (2020). Deep multi-task multi-label cnn for effective facial attribute classification. IEEE Transactions on Affective Computing.

  • Misra, I., & Maaten, L. V. D. (2020). Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6707–6717).

  • Miyato, T., Maeda, S., Koyama, M., & Ishii, S. (2018). Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979–1993.

    Article  Google Scholar 

  • Nguyen, H. M., Ly, N. Q., Phung, T. T. (2018). Large-scale face image retrieval system at attribute level based on facial attribute ontology and deep neuron network. In Proceedings of the Asian conference on intelligent information and database systems (pp. 539–549).

  • Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European conference on computer vision (pp. 69–84).

  • Qi, G. J., & Luo, J. (2020). Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.

  • Rao, Y., Lu, J., & Zhou, J. (2019). Learning discriminative aggregation network for video-based face recognition and person re-identification. International Journal of Computer Vision, 127(6), 701–718.

    Article  Google Scholar 

  • Ruan, D., Mo, R., Yan, Y., Chen, S., Xue, J. H., & Wang, H. (2022). Adaptive deep disturbance-disentangled learning for facial expression recognition. International Journal of Computer Vision, 130(2), 455–477.

    Article  Google Scholar 

  • Rudd, E. M., Günther, M., Boult, T. E. (2016). Moon: A mixed objective optimization network for the recognition of facial attributes. In Proceedings of the European conference on computer vision (pp. 19–35).

  • Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2234–2242.

    Google Scholar 

  • Sharma A. K., Foroosh H. (2020). Slim-cnn: A light-weight cnn for face attribute prediction. In Proceedings of the IEEE international conference on automatic face and gesture recognition (pp. 329–335).

  • Shu, Y., Yan, Y., Chen, S., Xue, J. H., Shen, C., Wang, H. (2021). Learning spatial-semantic relationship for facial attribute recognition with limited labeled data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11916–11925).

  • Sohn, K., Berthelot, D., Li, C. L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., Raffel, C. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.

  • Song, L., Zhang, M., Wu, X., He, R. (2018). Adversarial discriminative heterogeneous face recognition. In Proceedings of the AAAI conference on artificial intelligence (pp.1–7).

  • Song, L., Cao, J., Song, L., Hu, Y., He, R. (2019). Geometry-aware face completion and editing. In Proceedings of the AAAI conference on artificial intelligence (pp. 2506–2513).

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).

  • Tang, Y., Wang, J., Wang, X., Gao, B., Dellandréa, E., Gaizauskas, R., & Chen, L. (2017). Visual and semantic knowledge transfer for large scale semi-supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 3045–3058.

    Article  Google Scholar 

  • Terhörst, P., Fährmann, D., Kolf J. N., Damer, N., Kirchbuchner, F., Kuijper, A. (2020). Maad-face: A massively annotated attribute dataset for face images. arXiv preprint arXiv:2012.01030.

  • Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X. (2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12275–12284).

  • Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T. S. (2018). Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7268–7277).

  • Woo, S., Park, J., Lee, J. Y., Kweon, I. S. (2018). CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (pp. 3–19).

  • Wu, H., & Prasad, S. (2017). Semi-supervised deep learning using pseudo labels for hyperspectral image classification. IEEE Transactions on Image Processing, 27(3), 1259–1270.

    Article  MathSciNet  MATH  Google Scholar 

  • Yan, Y., Xu, Y., Xue, J.-H., Lu, Y., Wang, H., Zhu, W. (2022). Drop loss for person attribute recognition with imbalanced noisy-labeled samples. IEEE Transactions on Cybernetics.

  • Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., & Sang, N. (2021). BiSeNet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129(11), 3051–3068.

  • Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L. (2019). S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision (pp. 1476–1485).

  • Zhang, H., Cisse, M., Dauphin, Y. N., Lopez-Paz, D. (2017.a) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

  • Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L. (2014). Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1637–1644).

  • Zhang, S., He, R., Sun, Z., & Tan, T. (2017). Demeshnet: Blind face inpainting for deep meshface verification. IEEE Transactions on Information Forensics and Security, 13(3), 637–647.

    Article  Google Scholar 

  • Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y. (2018). Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (pp. 286–301).

  • Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C. C., Lin, D., Jia, J. (2018). Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European conference on computer vision (pp. 267–283).

  • Zhao, X., Li, H., Shen, X., Liang, X., Wu, Y. (2018). A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European conference on computer vision (pp. 401–416).

  • Zheng, X., Guo, Y., Huang, H., Li, Y., & He, R. (2020). A survey of deep facial attribute analysis. International Journal of Computer Vision, 128(8), 2002–2034.

    Article  Google Scholar 

Download references

Acknowledgements

This work was partly supported by the National Natural Science Foundation of China under Grants 62071404, U21A20514, and 61872307, by the Open Research Projects of Zhejiang Lab under Grant 2021KG0AB02, by the Natural Science Foundation of Fujian Province under Grant 2020J01001, and by the Youth Innovation Foundation of Xiamen City under Grant 3502Z20206046.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Yan.

Additional information

Communicated by Maja Pantic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, Y., Shu, Y., Chen, S. et al. SPL-Net: Spatial-Semantic Patch Learning Network for Facial Attribute Recognition with Limited Labeled Data. Int J Comput Vis 131, 2097–2121 (2023). https://doi.org/10.1007/s11263-023-01787-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01787-w

Keywords

Navigation