Abstract
Self-attention aggregates similar feature information to enhance the features. However, the attention covers nonface areas in face alignment, which may be disturbed in challenging cases, such as occlusions, and fails to predict landmarks. In addition, the learned feature similarity variance is not large enough in the experiment. To this end, we propose structural dependence learning based on self-attention for face alignment (SSFA). It limits the self-attention learning to the facial range and adaptively builds the significant landmark structure dependency. Compared with other state-of-the-art methods, SSFA effectively improves the performance on several standard facial landmark detection benchmarks and adapts more in challenging cases.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks, In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: https://doi.org/10.1109/CV-PR.2018.00813.
W. Y. Wu, C. Qian, S. Yang, Q. Wang, Y. C. Cai, Q. Zhou. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp.2129–2138, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00227.
Y. Sun, X. G. Wang, X. O. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, pp. 3476–3483, 2013. DOI: https://doi.org/10.1109/CVPR.2013.446.
S. Z. Zhu, C. Li, C. C. Loy, X. O. Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp.4998–5006, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7299134.
Z. P. Zhang, P. Luo, C. C. Loy, X. O. Tang. Facial landmark detection by deep multi-task learning. In Proceedings of the 13th European Conference on Computer Vision, Zürich, Switzerland, pp. 94–108, 2014. DOI: https://doi.org/10.1007/978-3-319-10599-4_7.
W. J. Li, Y. H. Lu, K. Zheng, H. F. Liao, C. Lin, J. B. Luo, C. T. Cheng, J. Xiao, L. Lu, C. F. Kuo, S. Miao. Structured landmark detection via topology-adapting deep graph learning. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 266–283, 2020. DOI: https://doi.org/10.1007/978-3-030-58545-716.
Z. W. Liu, X. Y. Zhu, G. S. Hu, H. Y. Guo, M. Tang, Z. Lei, N. M. Robertson, J. Q. Wang. Semantic alignment: Finding semantically consistent ground-truth for facial landmark detection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 3462–3471, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00358.
Z. H. Feng, J. Ktttler, M. Awais, P. Huber, X.-J. Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 2235–2245, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00238.
A. Newell, K. Y. Yang, Jia Deng. Stacked hourglass networks for human pose estimaton. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, pp. 483–499, 2016. DOI: https://doi.org/10.1007/978-3-319-46484-8_29.
K. Sun, B. Xiao, D. Liu, J. D. Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 5686–5696, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00584.
J. D. Wang, K. Sun, T. H. Cheng, B. R. Jiang, C. R. Deng, Y. Zhao, D. Liu, Y. D. Mu, M. K. Tan, X. G. Wang, W. Y. Liu, B. Xiao. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, 2021. DOI: https://doi.org/10.1109/TPAMI.2020.2983686.
Z. Z. Zhang, C. L. Lan, W. J. Zeng, X. Jin, Z. B. Chen. Relation-aware global attention for person re-identification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 3183–3192, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00325.
J. Fu, J. Liu, H. J. Tian, Y. Li, Y. J. Bao, Z. W. Fang, H. Q. Lu. Dual attention network for scene segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 3141–3149, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00326.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
S. Woo, J. Park, J. Y. Lee, I. S. Kweon. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 3–19, 2018. DOI: https://doi.org/10.1007/978-3-030-01234-21.
Y. Cao, J. R. Xu, S. Lin, F. Y. Wei, H. Hu. GCNet: Nonlocal networks meet squeeze-excitation networks and beyond. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, pp. 1971–1980, 2019. DOI: https://doi.org/10.1109/ICCVW.2019.00246.
T. Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, P. Dollar. Microsoft COCO: Common objects in context, [Online], Available: https://arxiv.org/abs/1405.0312.
M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, vol. 88, pp. 303–338, 2010. DOI: https://doi.org/10.1007/s11263-009-0275-4.
P. C. Gao, K. Lu, J. Xue, L. Shao, J. Y. Lyu. A coarse-to-fine facial landmark detection method based on self-attention mechanism. IEEE Transactions on Multimedia, vol. 23, pp. 926–938, 2021. DOI: https://doi.org/10.1109/TMM.2020.2991507.
Z. H. Jiang, W. H. Yu, D. Q. Zhou, Y. P. Chen, J. S. Feng, S. C. Yan. ConvBERT: Improving BERT with span-based dynamic convolution. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.
S. Yang, P. Luo, C. C. Loy, X. O. Tang. WIDER FACE: A face detection benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 5525–5533, 2016. DOI: https://doi.org/10.1109/CVPR.2016.596.
V. Le, J. Brandt, Z. Lin, L. Bourdev, T. S. Huang. Interactive facial feature localization. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, pp. 679–692, 2012. DOI: https://doi.org/10.1007/978-3-642-33712-3_49.
X. X. Zhu, D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, pp. 2879–2886, 2012. DOI: https://doi.org/10.1109/CVPR.2012.6248014.
P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, N. Kumar. Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2930–2940, 2013. DOI: https://doi.org/10.1109/TPAMI.2013.23.
M. Köestinger, P. Wohlhart, P. M. Roth, H. Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Proceedings of IEEE International Conference on Computer Vision Workshops, Barcelona, Spain, pp. 2144–2151, 2011. DOI: https://doi.org/10.1109/ICCVW.2011.6130513.
G. Ghiasi, C. C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces, [Online], Available: https://arxiv.org/abs/1506.08347.
A. Dapogny, M. Cord, K. Bailly. DeCaFa: Deep convolutional cascade for face alignment in the wild. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 6892–6900, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00699.
X. Y. Wang, L. F. Bo, L. Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 6970–6980, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00707.
A. Kumar, T. K. Marks, W. X. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike-Akino, X. M. Liu, C. Feng. LUVLi face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 8233–8243, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00826.
A. Dapogny, K. Bailly, M. Cord. Deep entwined learning head pose and face alignment inside an attentional cascade with doubly-conditional fusion. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, pp. 192–198, 2020. DOI: https://doi.org/10.1109/FG47880.2020.00038.
J. H. Xia, W. W. Qu, W. J. Huang, J. G. Zhang, X. Wang, M. Xu. Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 4042–4051, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.00402.
C. C. Zhu, X. T. Wan, S. R. Xie, X. Q. Li, Y. Z. Gu. Occlusion-robust face alignment using a viewpoint-invariant hierarchical network architecture. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 11102–11111, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01083.
J. Wan, J. Liu, J. Zhou, Z. H. Lai, L. L. Shen, H. Sun, P. Xiong, W. W. Min. Precise facial landmark detection by reference heatmap transformer. IEEE Transactions on Image Processing, vol. 32, pp. 1966–1977, 2023. DOI: https://doi.org/10.1109/TIP.2023.3261749.
J. H. Xia, M. Xu, H. M. Zhang, J. G. Zhang, W. J. Huang, H. Cao, S. P. Wen. Robust face alignment via inherent relation learning and uncertainty estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10358–10375, 2023. DOI: https://doi.org/10.1109/TPAMI.2023.3260926.
J. So, Y. Han. Heatmap-guided selective feature attention for robust cascaded face alignment. Sensors, vol. 23, no. 10, Article number 4731, 2023. DOI: https://doi.org/10.3390/s23104731.
M. Kowalski, J. Naruniec, T. Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, USA, pp. 2034–2043, 2017. DOI: https://doi.org/10.1109/CVPRW.2017.254.
X. Miao, X. T. Zhen, X. L. Liu, C. Deng, V.s Athitsos, H. Huang. Direct shape regression networks for end-to-end face alignment. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 5040–5049, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00529.
X. Y. Dong, Y. Yan, W. L. Ouyang, Y. Yang. Style aggregated network for facial landmark detection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 379–388, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00047.
X. Zou, S. Zhong, L. X. Yan, X. Y. Zhao, J. H. Zhou, Y. Wu. Learning robust facial landmark detection via hierarchical structured ensemble. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 141–150, 2019. DOI: https://doi.org/10.1109/IC-CV.2019.00023.
B. Browatzki, C Wallraven. 3FaBrec: Fast few-shot face alignment by reconstruction. In Proceedings of IEEE/ CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 6109–6119, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00615.
H. B. Jin, S. C. Liao, L. Shao. Pixel-in-pixel Net: Towards efficient facial landmark detection in the wild. International Journal of Computer Vision, vol. 129, no. 12, pp. 3174–3194, 2021. DOI: https://doi.org/10.1007/s11263-021-01521-4.
H. Li, Z. D. Guo, S. M. Rhee, S. Han, J. J. Han Towards accurate facial landmark detection via cascaded transformers In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 4166–4175, 2022 DOI: https://doi.org/10.1109/CVPR526882022.00414.
C. Z. Lin, B. Zhu, Q. Wang, R. J. Liao, C. Qian, J. W. Lu, J. Zhou. Structure-coherent deep feature learning for robust face alignment. IEEE Transactions on Image Processing, vol. 30, pp. 5313–5326, 2021. DOI: https://doi.org/10.1109/TIP2021.3082319.
Acknowledgements
This work was supported by the National Key R&D Program of China (No. 2021YFE0205700), the National Natural Science Foundation of China (Nos. 62076235, 62276260 and 62002356), sponsored by the Zhejiang Lab (No. 2021KH0AB07) and the Ministry of Education Industry-University Cooperative Education Program (Wei Qiao Venture Group, No. E1425201).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declared that they have no conflicts of interest to this work.
Additional information
Colored figures are available in the online version at https://link.springer.com/journal/11633
Biying Li received the B. Eng. degree in automation from Xi’an Jiaotong University, China in 2018. She is currently a Ph.D. degree candidate in the Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, China.
Her research interests include 3D face and human understanding, image and video processing, and pattern recognition.
Zhiwei Liu received the B. Sc. degree in software engineering from Sichuan University, China in 2015, and the Ph.D. degree in pattern recognition and intelligent system from the Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, China in 2020. Currently, he is an assistant professor. He has published several papers on CVPR, AAAI, ACMMM, ECCV, TMM, and TOMM. He is participating in several national projects, including the National Natural Science Foundation of China.
His research interests include 3D face and human understanding, virtual human generation and control, and human-centric AI-generated content.
Wei Zhou received the B. Eng. degree in software engineering from the Beijing Institute of Technology, China in 2007, the M. Sc. degree in software engineering in Peking University, China in 2010, and is currently a Ph.D. degree candidate in Tsinghua University, China. He serves as Chief Investment Officer of Wuhan Artificial Intelligence Research Institute.
His research interest is intelligent decisions driven by multimodal heterogeneous data.
Haiyun Guo received the B. Sc. degree in electronic information science and technology from Wuhan University, China in 2013, and the Ph.D. degree in pattern recognition and intelligent systems from the University of Chinese Academy of Sciences, China in 2018. Currently, she is an associate research fellow at the Institute of Automation, Chinese Academy of Sciences, China.
Her research interests include image and video analysis, multimodal understanding, large-scale model training, and general model design.
Xin Wen received the B. Eng. degree in communication engineering from Chongqing University of Posts and Telecommunications, China in 2016, and the M. Sc. degree in computer technology from the University of Chinese Academy of Sciences, China in 2021. She is currently a Ph.D. degree candidate at the National University of Defense Technology, China.
Her research interests include image processing, pattern recognition and 3D reconstruction.
Min Huang received the B. Sc. and Ph.D. degrees in computer sciences and technology from Wuhan University, China in 2002 and 2007. From 2017 to 2018, she was a visiting scholar with the School of Informatics at the University of Edinburgh, UK. She is currently an associate professor at the School of Artificial Intelligence, University of Chinese Academy of Sciences, China.
Her research interests include machine learning, knowledge engineering and pattern recognition.
Jinqiao Wang received the B. Eng. degree in mechanical and electronic engineering from Hebei University of Technology, China in 2001, and the M. Sc. degree in mechanical and electronic engineering from Tianjin University, China in 2004. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, China in 2008. He is currently a professor at the Chinese Academy of Sciences, China.
His research interests include pattern recognition and machine learning, image and video processing, mobile multi-media, and intelligent video surveillance.
Rights and permissions
About this article
Cite this article
Li, B., Liu, Z., Zhou, W. et al. Structural Dependence Learning Based on Self-attention for Face Alignment. Mach. Intell. Res. 21, 514–525 (2024). https://doi.org/10.1007/s11633-023-1465-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-023-1465-1