ABSTRACT
Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.
Supplemental Material
- Tadas Baltrusaitis, Marwa Mahmoud, and Peter Robinson. 2015. Cross-Dataset Learning and Person-Specific Normalisation for Automatic Action Unit Detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, Ljubljana, 1--6. https://doi.org/10.1109/FG.2015.7284869Google ScholarDigital Library
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In Computer Vision -- ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Vol. 12346. Springer International Publishing, Cham, 213--229. https://doi.org/10.1007/978--3-030--58452--8_13Google ScholarDigital Library
- Mark Chen and Alec Radford. 2020. Generative Pretraining from Pixels. In International Conference on Machine Learning. PMLR, 1691--1703.Google Scholar
- Shikai Chen, Jianfeng Wang, Yuedong Chen, Zhongchao Shi, Xin Geng, and Yong Rui. 2020. Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 13981--13990. https://doi.org/10.1109/CVPR42600.2020.01400Google Scholar
- Yuedong Chen, Jianfeng Wang, Shikai Chen, Zhongchao Shi, and Jianfei Cai. 2019. Facial Motion Prior Networks for Facial Expression Recognition. In 2019 IEEE Visual Communications and Image Processing (VCIP). IEEE, Sydney, Australia, 1--4. https://doi.org/10.1109/VCIP47243.2019.8965826Google Scholar
- Yifan Chen, Yang Wang, Pengjie Ren, Meng Wang, and Maarten de Rijke. 2022. Bayesian Feature Interaction Selection for Factorization Machines. Artificial Intelligence 302 (Jan. 2022), 103589. https://doi.org/10.1016/j.artint.2021.103589Google ScholarDigital Library
- Arun Das, Jeffrey Mock, Yufei Huang, Edward Golob, and Peyman Najafirad. 2021. Interpretable Self-Supervised Facial Micro-Expression Learning to Predict Cognitive State and Neurological Disorders. (Nov. 2021), 9.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (May 2019). arXiv:1810.04805 [cs]Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs] (June 2021). arXiv:2010.11929 [cs]Google Scholar
- X. Feng, M. Pietikäinen, and A. Hadid. 2007. Facial Expression Recognition Based on Local Binary Patterns. Pattern Recognition and Image Analysis 17, 4 (Dec. 2007), 592--598. https://doi.org/10.1134/S1054661807040190Google ScholarCross Ref
- Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R. Martin, Ming-Ming Cheng, and Shi-Min Hu. 2021. Attention Mechanisms in Computer Vision: A Survey. arXiv:2111.07624 [cs] (Nov. 2021). arXiv:2111.07624 [cs]Google Scholar
- Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: Challenge of Recognizing One Million Celebrities in the Real World. Electronic Imaging 2016, 11 (Feb. 2016), 1--6. https://doi.org/10.2352/ISSN.2470--1173.2016.11.IMAWM-463Google Scholar
- Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked Autoencoders Are Scalable Vision Learners. arXiv:2111.06377 [cs] (Dec. 2021). arXiv:2111.06377 [cs]Google Scholar
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv:1911.05722 [cs] (March 2020). arXiv:1911.05722 [cs]Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 770--778. https://doi.org/10.1109/CVPR.2016.90Google Scholar
- Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate Attention for Efficient Mobile Network Design. arXiv:2103.02907 [cs] (March 2021). arXiv:2103.02907 [cs]Google Scholar
- Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2019. Squeeze-and-Excitation Networks. arXiv:1709.01507 [cs] (May 2019). arXiv:1709.01507 [cs]Google Scholar
- Qionghao Huang, Changqin Huang, Xizhe Wang, and Fan Jiang. 2021. Facial Expression Recognition with Grid-Wise Attention and Visual Transformer. Information Sciences 580 (Nov. 2021), 35--54. https://doi.org/10.1016/j.ins.2021.08.043Google ScholarDigital Library
- Steven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. 2019. Compacting, Picking and Growing for Unforgetting Continual Learning. arXiv:1910.06562 [cs, stat] (Oct. 2019). arXiv:1910.06562 [cs, stat]Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2016. Spatial Transformer Networks. arXiv:1506.02025 [cs] (Feb. 2016). arXiv:1506.02025 [cs]Google Scholar
- Zeng Jiabei, Shan Shiguang, and Chen Xilin. 2018. Facial Expression Recognition with Inconsistently Annotated Datasets. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
- Amine Kechaou, Manuel Martinez, Monica Haurilet, and Rainer Stiefelhagen. 2021. Detective: An Attentive Recurrent Model for Sparse Object Detection. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, Milan, Italy, 5340--5347. https://doi.org/10.1109/ICPR48806.2021.9412336Google Scholar
- Dimitrios Kollias, Shiyang Cheng, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. Deep Neural Network Augmentation: Generating Faces for Affect Analysis. International Journal of Computer Vision 128, 5 (May 2020), 1455--1484. https://doi.org/10.1007/s11263-020-01304--3Google ScholarCross Ref
- Hanting Li, Mingzhe Sui, Feng Zhao, Zhengjun Zha, and Feng Wu. 2021. MVT: Mask Vision Transformer for Facial Expression Recognition in the Wild. arXiv:2106.04520 [cs] (July 2021). arXiv:2106.04520 [cs]Google Scholar
- Shan Li and Weihong Deng. 2020. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing (2020), 1--1. https://doi.org/10.1109/TAFFC.2020.2981446Google Scholar
- Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, 2584--2593. https://doi.org/10.1109/CVPR.2017.277Google Scholar
- Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen. 2019. Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism. IEEE Transactions on Image Processing 28, 5 (May 2019), 2439--2450. https://doi.org/10.1109/TIP.2018.2886767Google ScholarCross Ref
- Fuyan Ma, Bin Sun, and Shutao Li. 2021. Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing (2021), 1--1. https://doi.org/10.1109/TAFFC.2021.3122146Google ScholarDigital Library
- Jiageng Mao, Minzhe Niu, Haoyue Bai, Xiaodan Liang, Hang Xu, and Chunjing Xu. 2021. Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection. arXiv:2109.02499 [cs] (Sept. 2021). arXiv:2109.02499 [cs]Google Scholar
- Shervin Minaee, Mehdi Minaei, and Amirali Abdolrashidi. 2021. Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network. Sensors 21, 9 (April 2021), 3046. https://doi.org/10.3390/s21093046Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent Models of Visual Attention. arXiv:1406.6247 [cs, stat] (June 2014). arXiv:1406.6247 [cs, stat]Google Scholar
- Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 10, 1 (Jan. 2019), 18--31. https://doi.org/10.1109/TAFFC.2017.2740923Google ScholarDigital Library
- Mahdi Pourmirzaei, Gholam Ali Montazer, and Farzaneh Esmaili. 2021. Using Self-Supervised Auxiliary Tasks to Improve Fine-Grained Facial Representation. arXiv:2105.06421 [cs] (Aug. 2021). arXiv:2105.06421 [cs]Google Scholar
- Christopher Pramerdorfer and Martin Kampel. 2016. Facial Expression Recognition Using Convolutional Neural Networks: State of the Art. arXiv:1612.02903 [cs] (Dec. 2016). arXiv:1612.02903 [cs]Google Scholar
- Biao Qian, Yang Wang, Richang Hong, Meng Wang, and Ling Shao. 2021. Diversifying Inference Path Selection: Moving-Mobile-Network for Landmark Recognition. IEEE Transactions on Image Processing 30 (2021), 4894--4904. https://doi.org/10.1109/TIP.2021.3076275Google ScholarCross Ref
- Shuvendu Roy and Ali Etemad. 2021. Self-Supervised Contrastive Learning of Multi-view Facial Expressions. In Proceedings of the 2021 International Conference on Multimodal Interaction. ACM, Montréal QC Canada, 253--257. https://doi.org/10.1145/3462244.3479955Google ScholarDigital Library
- Ruicong Zhi, Markus Flierl, Qiuqi Ruan, and W Bastiaan Kleijn. 2011. Graph-Preserving Sparse Nonnegative Matrix Factorization With Application to Facial Expression Recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41, 1 (Feb. 2011), 38--52. https://doi.org/10.1109/TSMCB.2010.2044788Google ScholarDigital Library
- Henrique Siqueira, Sven Magg, and Stefan Wermter. 2020. Efficient Facial Feature Learning with Wide Ensemble-Based Convolutional Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (April 2020), 5800--5809. https://doi.org/10.1609/aaai.v34i04.6037Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). arXiv:1706.03762 [cs]Google Scholar
- Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing Uncertainties for Large-Scale Facial Expression Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 6896--6905. https://doi.org/10.1109/CVPR42600.2020.00693Google Scholar
- Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Transactions on Image Processing 29 (2020), 4057--4069. https://doi.org/10.1109/TIP.2019.2956143Google ScholarDigital Library
- Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. arXiv:2005.10242 [cs, stat] (Nov. 2020). arXiv:2005.10242 [cs, stat]Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 7794--7803. https://doi.org/10.1109/CVPR.2018.00813Google Scholar
- Yang Wang. 2021. Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry, and Fusion. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (March 2021), 1--25. https://doi.org/10.1145/3408317Google ScholarDigital Library
- Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan. 2016. Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering. arXiv:1608.05560 [cs, stat] (Aug. 2016). arXiv:1608.05560 [cs, stat]Google Scholar
- Y. Wang J. Peng H. Wang M. Wang. 2022. Progressive Learning with Multi-scale Attention Network for Cross-domain Vehicle Re-identification. In SCIENCE CHINA Information Sciences 2022.Google ScholarCross Ref
- Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. 2021. Masked Feature Prediction for Self-Supervised Visual Pre-Training. arXiv:2112.09133 [cs] (Dec. 2021). arXiv:2112.09133 [cs]Google Scholar
- Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In Computer Vision -- ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Vol. 9911. Springer International Publishing, Cham, 499--515. https://doi.org/10.1007/978--3--319--46478--7_31Google ScholarCross Ref
- Zhengyao Wen, Wenzhong Lin, Tao Wang, and Ge Xu. 2021. Distract Your Attention: Multi-head Cross Attention Network for Facial Expression Recognition. arXiv:2109.07270 [cs] (Nov. 2021). arXiv:2109.07270 [cs]Google Scholar
- Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. arXiv:1807.06521 [cs] (July 2018). arXiv:1807.06521 [cs]Google Scholar
- Lin Wu, Yang Wang, Junbin Gao, Meng Wang, Zheng-Jun Zha, and Dacheng Tao. 2021. Deep Coattention-Based Comparator for Relative Representation Learning in Person Re-Identification. IEEE Transactions on Neural Networks and Learning Systems 32, 2 (Feb. 2021), 722--735. https://doi.org/10.1109/TNNLS.2020.2979190Google ScholarCross Ref
- Kaihao Zhang, Yongzhen Huang, Yong Du, and Liang Wang. 2017. Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks. IEEE Transactions on Image Processing 26, 9 (Sept. 2017), 4193--4203. https://doi.org/10.1109/TIP.2017.2689999Google ScholarDigital Library
- Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. 2020. Exploring Self-Attention for Image Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 10073--10082. https://doi.org/10.1109/CVPR42600.2020.01009Google Scholar
- Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2021. Robust Lightweight Facial Expression Recognition Network with Label Distribution Training. (Nov. 2021), 10.Google Scholar
- Lin Zhong, Qingshan Liu, Peng Yang, Bo Liu, Junzhou Huang, and Dimitris N Metaxas. 2012. Learning active facial patches for expression analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2562--2569.Google ScholarCross Ref
Index Terms
- Blindfold Attention: Novel Mask Strategy for Facial Expression Recognition
Recommendations
Facial expression recognition with Convolutional Neural Networks
Facial expression recognition has been an active research area in the past 10 years, with growing application areas including avatar animation, neuromarketing and sociable robots. The recognition of facial expressions is not an easy problem for machine ...
Expression-invariant face recognition by facial expression transformations
In this paper, we present a method of expression-invariant face recognition that transforms input face image with an arbitrary expression into its corresponding neutral facial expression image. When a new face image with an arbitrary expression is ...
Pose-robust feature learning for facial expression recognition
Automatic facial expression recognition (FER) from non-frontal views is a challenging research topic which has recently started to attract the attention of the research community. Pose variations are difficult to tackle and many face analysis methods ...
Comments