Elsevier

Pattern Recognition

Volume 100, April 2020, 107155
Pattern Recognition

BLAN: Bi-directional ladder attentive network for facial attribute prediction

https://doi.org/10.1016/j.patcog.2019.107155Get rights and content

Highlights

  • A novel Bi-directional Ladder Attentive Network (BLAN) to make facial attribute prediction better.

  • Learning hierarchical representations for exploiting the correlations between feature hierarchies and attribute characteristics.

  • Residual Dual Attention Module (RDAM) shows the excellent ability in interweaving features from the encoder and the decoder.

  • Local Mutual Information Maximization (LMIM) loss further incorporates the locality of the input attribute features to the high-level representations and produces high-quality features.

  • Adaptive score fusion module performs well in merging multiple global and local decisions from all hierarchies.

Abstract

Deep facial attribute prediction has received considerable attention with a wide range of real-world applications in the past few years. Existing works almost extract abstract global features at high levels of deep neural networks to make predictions. However, local features at low levels, which contain detailed local attribute information, are not well exploited. In this paper, we propose a novel Bi-directional Ladder Attentive Network (BLAN) to learn hierarchical representations, covering the correlations between feature hierarchies and attribute characteristics. BLAN adopts layer-wise bi-directional connections based on the autoencoder framework from low to high levels. In this way, hierarchical features with local and global attribute characteristics could be correspondingly interweaved at each level via multiple designed Residual Dual Attention Modules (RDAMs). Besides, we derive a Local Mutual Information Maximization (LMIM) loss to further incorporate the locality of facial attributes to high-level representations at each hierarchy. Multiple attribute classifiers receive hierarchical representations to produce local and global decisions, followed by a proposed adaptive score fusion module to merge these decisions for yielding the final prediction result. Extensive experiments on two facial attribute datasets, CelebA and LFWA, demonstrate that our BLAN outperforms state-of-the-art methods.

Introduction

Facial attributes represent intuitive semantic features that describe visual properties of face images [1], [2], such as smiling and eyeglasses, contributing to numerous real-world applications, e.g., face verification [3], [4], face recognition [5], [6], and face retrieval [7], [8]. Given a face image, facial attribute prediction aims to estimate whether desired attributes are present by learning discriminative feature representations and constructing accurate attribute classifiers.

Recently, deep convolutional neural networks (CNNs) have gained great popularity and have dramatically improved the performance of state-of-the-art algorithms in the field of facial attribute prediction. In general, deep facial attribute prediction methods can be categorized into two groups: part-based methods [9], [10] and holistic methods [11], [12]. Part-based methods first locate the positions of facial attributes and then extract features according to obtained location cues for the subsequent attribute prediction. In contrast, holistic methods learn attribute relationships and estimate facial attributes from the entire face images without any additional localization mechanism.

In this paper, we focus on holistic facial attribute prediction methods. The insight in this line of work lies in capturing shared and specific attribute features with customized architectures. Specifically, the customized networks learn shared features of all attributes across low-level layers. Then, these features flow to high-level layers, which resort to multiple split branches to predict attributes with different characteristics. However, in this process, only the high-level abstract features at the end of each branch take part in the final attribute prediction. The low-level shared information at low-level layers might vanish when arriving at the high-level layers [12]. Consequently, low-level features may not be fully explored and utilized.

Such deficiency of current holistic facial attribute methods prompts us to reconsider the relationship between the CNN network architecture and its extracted features at each level. Rather than capturing features with the commonality and specialty in deep networks, this paper considers leveraging the hierarchical structure of a deep network to learn the locality and globality of facial attribute features. Specifically, low-level CNN layers capture subtle and detailed face features, corresponding to the attributes that appear in local face regions, i.e., local facial attributes. As CNNs go deeper, more global and abstract information is explored to estimate the attributes that rely on the entire face to make predictions, i.e., global facial attributes. Therefore, the local and global natures of facial attributes can be significantly projected to the local and global feature representations, which are captured by low-level and high-level hierarchies of deep networks.

Taking such correlations between feature hierarchies and attribute characteristics, we design a novel Bi-directional Ladder Attentive Network (BLAN) to learn hierarchical feature representations from low-levels to high-levels, correspondingly to predict facial attributes with the locality and the globality. BLAN is constructed based on the autoencoder framework with multiple layer-wise bi-directional connections between its encoder and decoder. The encoder and decoder features learned at each level are fed into the proposed Residual Dual Attention Module (RDAM). RDAM adaptively interweaves these features to learn complementary information via residual connections. Besides, it employs dual channel-wise and spatial-wise attention to jointly learn what and where to focus, yielding richer attentive feature representations. To further improve the quality of learned interweaved representations at each level, Local Mutual Information Maximization (LMIM) loss is derived for incorporating the locality of input attributes into high-level representations. After that, multiple hierarchical classifiers operate on learned hierarchical attentive features with maximized mutual information to produce global and local decisions. Then, an adaptive score fusion module is followed to merge these multiple decisions at each level of BLAN, resulting in a further boost of the final performance. Extensive experiments on two facial attribute datasets CelebA and LFWA demonstrate that the proposed method outperforms state-of-the-art methods.

The main contributions are summarized as follows.

  • We propose a novel Bi-directional Ladder Attentive Network (BLAN) which exploits the correlations between low-to-high hierarchy features and local-to-global facial attributes. Layer-wise bi-directional connections are designed based on the autoencoder framework to learn complementary features from the encoder and the decoder.

  • Residual Dual Attention Module (RDAM) is developed to jointly learn dual channel-wise and spatial-wise attention for interweaving the encoder and decoder features. The residual connection ensures to capture complementary information.

  • A Local Mutual Information Maximization (LMIM) loss is introduced to maximize the deep mutual information between input attentive attribute features and learned abstract representations, yielding improved features at each hierarchy.

  • We present an adaptive score fusion strategy to merge local and global decisions from multiple hierarchical attribute classifiers for further boosting the performance of facial attribute prediction. Superior experimental results on two facial attribute datasets CelebA and LFWA demonstrate the effectiveness of the proposed BLAN.

Section snippets

Facial attribute prediction

Existing deep facial attribute prediction works can be generally grouped into two broad categories: part-based methods and holistic methods. We provide a detailed introduction about the two categories below, respectively.

Part-based methods extract feature representations from different positions of facial attributes. Each position corresponds to a single attribute classifier. Hence, the key of part-based methods exists in the localization mechanism, which further classifies part-based methods

Bi-directional ladder attentive network

Given facial attribute images, the proposed BLAN first learns hierarchical feature representations from low-level layers to high-level layers under the autoencoder framework, corresponding to local and global features with the locality and the globality of facial attributes. Then, learned representations from both the encoder and the decoder at different hierarchies are fed into multiple residual dual attention modules for interweaving more discriminative attentive features. Next, these

Experiments

In this section, we systemically conduct experiments on two facial attribute datasets: CelebA and LFWA [17]. First, we introduce their descriptions and test protocols. Second, the implementation details involving training schemes, hyperparameter configurations, and attention settings are provided. Third, we compare and discuss our BLAN with state-of-the-art methods. Then, we experimentally illustrate the effectiveness of the hierarchical features learned by BLAN. Finally, the in-depth analysis

Conclusion and future works

In this paper, we study the facial attribute prediction problem by exploiting the correlations between hierarchical features and attributes with the locality and the globality characteristics. We have proposed a novel Bi-directional Ladder Attentive Network (BLAN) to learn hierarchical representations at different levels of an autoencoder framework. Layer-wise bi-directional connections between the encoder and the decoder ensure to capture richer local and global attribute representations by

Acknowledgements

This work is supported in part by the the State Key Development Program (Grant No. 2016YFB1001001), in part by the National Natural Science Foundation of China (NSFC) under Grant U1736119 and Grant U1936117, as well as the Fundamental Research Funds for the Central Universities under Grant DUT18JC06.

Xin Zheng received the B.E. degree in Integrated Circuit Design and Integration System, Dalian University of Technology, in 2017. She is currently a Master Student in the School of Information and Communication Engineering, Dalian University of Technology. Her research interests are in computer vision and pattern recognition.

References (42)

  • N. Zhang et al.

    PANDA: pose aligned networks for deep attribute modeling

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2014)
  • J. Li et al.

    Landmark free face attribute prediction

    IEEE Trans. Image Process.

    (2018)
  • E.M. Hand et al.

    Attributes for improved attributes: a multi-task network utilizing implicit and explicit relationships for facial attribute classification.

    Proceedings of the 31st (AAAI) Conference on Artificial Intelligence

    (2017)
  • J. Cao et al.

    Partially shared multi-task convolutional neural network with local constraint for face attribute learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • L. Bourdev et al.

    Poselets: body part detectors trained using 3D human pose annotations

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2009)
  • M.M. Kalayeh et al.

    Improving facial attribute prediction using semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • U. Mahbub et al.

    Segment-based methods for facial attribute detection from partial faces

    IEEE Trans. Affective Comput.

    (2018)
  • Z. Liu et al.

    Deep learning face attributes in the wild

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2015)
  • H. Ding et al.

    A deep cascade network for unaligned face attribute classification

    Proceedings of the Conference on Artificial Intelligence(AAAI)

    (2018)
  • E.M. Rudd et al.

    MOON: a mixed objective optimization network for the recognition of facial attributes

    European Conference on Computer Vision (ECCV)

    (2016)
  • O.M. Parkhi et al.

    Deep face recognition

    Proceedings of the British Machine Vision Conference 2015, (BMVC)

    (2015)
  • Cited by (7)

    • Learning an attention-aware parallel sharing network for facial attribute recognition

      2023, Journal of Visual Communication and Image Representation
    • Deep learning approaches in face analysis

      2020, Learning Control: Applications in Robotics and Complex Dynamical Systems
    • Feature-Guided Perturbation for Facial Attribute Classification

      2023, IEEE Transactions on Artificial Intelligence
    • Facial Attributes Recognition Combined with Feature Decoupling and Static-Dynamic Joint Graph Convolutional Network

      2022, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
    • Prior-Guided Multi-scale Fusion Transformer for Face Attribute Recognition

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Xin Zheng received the B.E. degree in Integrated Circuit Design and Integration System, Dalian University of Technology, in 2017. She is currently a Master Student in the School of Information and Communication Engineering, Dalian University of Technology. Her research interests are in computer vision and pattern recognition.

    Huaibo Huang received the B.E. degree in Measurement and Control Technology and Instrument from Xi’an Jiaotong University in 2012, and the M.E. degree in Optical Engineering from Beihang University in 2016. He is currently a Ph.D. student in the Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition (NLPR), CASIA, Beijing, China. His current research interests include computer vision and pattern recognition.

    Yanqing Guo received the B.S. degree and Ph.D. degree in Electronic Engineering from Dalian University of Technology of China, in 2002 and 2009, respectively. He is currently a professor with School of Information and Communication Engineering, Dalian University of Technology. His research interests include multimedia security and forensics, digital image processing, deep learning and machine learning.

    Bo Wang received the Ph.D. degree from Dalian University of Technology, China, in 2010. He is currently an Associate Professor with the School of Information and Communication Engineering, Dalian University of Technology. His research interests include image forensics and image steganalysis.

    Ran He received the BE and MS degrees in computer science from Dalian University of Technology, and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, in 2001, 2004, and 2009, respectively. Since September 2010, he has been with the National Laboratory of Pattern Recognition, where he is currently an associate professor. He currently serves as an associate editor of Neurocomputing (Elsevier) and serves on the program committees of several conferences. His research interests include information theoretic learning, pattern recognition, and computer vision.

    View full text