Abstract
Facial action unit (AU) detection has been applied in a wild range of fields, and has attracted great attention over the last decades. Most existing methods employ the predefined regions of interest with same number and range for all samples. However, we find that the flexibility of predefined regions of interest is finite, as the occurrence of different AUs may not be simultaneous and their ranges change with intensity changes. In addition, many AU detection works try to independently design feature extraction modules and classifiers for each AU, which is of high computation cost and ignores the dependency among different AUs. In view of the limited flexibility of predefined regions of interest, we propose difference saliency maps that do not depend on facial landmarks. They are the spatial pixel-wise attentions, where each element represents the importance of the corresponding pixel on the entire image. Therefore, all the regions of interest can be irregular. In addition, in order to solve the problem of high computation cost, we combine group convolution with skip connection to propose a lightweight network that is more suitable for AU detection. All AUs share features and there is only one classifier, so the computation cost and the number of parameters are greatly reduced. In particular, the difference saliency maps and the global feature maps are combined to obtain the regional enhancement features. To maximize the enhancement effect, the down-sampled difference saliency maps are added to multiple blocks of the lightweight network. The enhanced global features are directly sent to the classifier for AU detection. By changing the number of neurons in the classifier, our framework can easily adapt to different datasets. Extensive experimental results show that the proposed framework soundly outperforms the classic deep learning method when evaluated on the DISFA+ and CK+ datasets. After adding the difference saliency maps, the detection result is better than the state-of-the-art AU detection methods. Further experiments demonstrate that our network is more efficient in using parameters, computation complexity and inference time.
Similar content being viewed by others
References
Benitez-Quiroz CF, Srinivasan R, Martinez AM (2016) Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 5562–5570. https://doi.org/10.1109/CVPR.2016.600
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, CIVR ’07, pp 401–408, Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1282280.1282340
Corneanu C, Madadi M, Escalera S (2018) Deep structure inference network for facial action unit recognition. In: Proceedings of the european conference on computer vision (ECCV), pp. 298–313
Eleftheriadis S, Rudovic O, Pantic M (2015) Multi-conditional latent variable model for joint facial action unit detection. In: 2015 IEEE international conference on computer vision (ICCV), pp 3792–3800. https://doi.org/10.1109/ICCV.2015.432
Friesen E, Ekman P (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3
Friesen E, Ekman P (2002) Facial action coding system(facs). A human face
Gupta V, Raman S (2017) Automatic trimap generation for image matting. In: 2016 International conference on signal and information processing (IConSIP)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259. https://doi.org/10.1109/34.730558
Jaiswal S, Valstar M (2016) Deep learning the dynamic appearance and shape of facial action units. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–8. https://doi.org/10.1109/WACV.2016.7477625
King DE (2009) Dlib-ml: A machine learning toolkit. J Mach Learn Res 10:1755–1758
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: International conference on learning representations (ICLR)
Li W, Abtahi F, Zhu Z (2017) Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6766–6775. https://doi.org/10.1109/CVPR.2017.716
Li W, Abtahi F, Zhu Z, Yin L (2017) Eac-net: A region-based deep enhancing and cropping approach for facial action unit detection. In: 2017 12th IEEE international conference on automatic face gesture recognition (FG 2017), pp 103–110. https://doi.org/10.1109/FG.2017.136
Li W, Abtahi F, Zhu Z, Yin L (2018) Eac-net: Deep nets with enhancing and cropping for facial action unit detection. IEEE Trans Pattern Anal Mach Intell 40(11):2583–2596. https://doi.org/10.1109/TPAMI.2018.2791608
Liu M, Yan X, Wang C, Wang K (2021) Segmentation mask-guided person image generation. Appl Intell 51(2):1161–1176. https://doi.org/10.1007/s10489-020-01907-w
Liu Z, Dong J, Zhang C, Wang L, Dang J (2020) Relation modeling with graph convolutional networks for facial action unit detection. In: International conference on multimedia modeling. Springer, pp 489–501
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, pp 94–101
Ma C, Chen L, Yong J (2019) Au r-cnn: Encoding expert prior knowledge into r-cnn for action unit detection. Neurocomputing 355:35–47. https://doi.org/10.1016/j.neucom.2019.03.082
Mavadati M, Sanger P, Mahoor MH (2016) Extended disfa dataset: Investigating posed and spontaneous facial expressions. In: 2016 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 1452–1459. https://doi.org/10.1109/CVPRW.2016.182
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987. https://doi.org/10.1109/TPAMI.2002.1017623
Shao Z, Liu Z, Cai J, Ma L (2018) Deep adaptive attention for joint facial action unit detection and face alignment. In: Proceedings of the european conference on computer vision (ECCV), pp 705–720. DOI10.1007/978-3-030-01261-8_43
Shao Z, Liu Z, Cai J, Ma L (2021) JÂA-Net: joint facial action unit detection and face alignment via adaptive attention. Int J Comput Vis 129(2):321–340. https://doi.org/10.1007/s11263-020-01378-z
Shao Z, Liu Z, Cai J, Wu Y, Ma L (2019) Facial action unit detection using attention and relation learning. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2019.2948635
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651. https://doi.org/10.1109/TPAMI.2016.2572683
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1409.1556
Song Y, McDuff D, Vasisht D, Kapoor A (2015) Exploiting sparsity and co-occurrence structure for action unit recognition. In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 1, pp 1–8. https://doi.org/10.1109/FG.2015.7163081
Song Z, Sui H, Hua L (2021) A hierarchical object detection method in large-scale optical remote sensing satellite imagery using saliency detection and CNN. Int J Remote Sens 42(8):2827–2847. https://doi.org/10.1080/01431161.2020.1826059
Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) Fera 2015 - second facial expression recognition and analysis challenge. In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 06, pp 1–8. https://doi.org/10.1109/FG.2015.7284874https://doi.org/10.1109/FG.2015.7284874
Wang B, Chen Q, Zhou M, Zhang Z, Gai K (2020) Progressive feature polishing network for salient object detection. Proc AAAI Conf Artif Intell 34(7):12,128–12,135
Wang S, Wu S, Peng G, Ji Q (2019) Capturing feature and label relations simultaneously for multiple facial action unit recognition. IEEE Trans Affect Comput 10(3):348–359. https://doi.org/10.1109/TAFFC.2017.2737540
Wang SJ, Lin B, Wang Y, Yi T, Zou B, wen Lyu X (2019) Action units recognition based on deep spatial-convolutional and multi-label residual network. Neurocomputing 359:130–138. https://doi.org/10.1016/j.neucom.2019.05.018
Wang Z, Li Y, Wang S, Ji Q (2013) Capturing global semantic relationships for facial action unit recognition. In: 2013 IEEE international conference on computer vision, pp 3304–3311. https://doi.org/10.1109/ICCV.2013.410
Zhang T, Qi G, Xiao B, Wang J (2017) Interleaved group convolutions. In: 2017 IEEE International conference on computer vision (ICCV), pp 4383–4392. https://doi.org/10.1109/ICCV.2017.469
Zhao K, Chu W, Zhang H (2016) Deep region and multi-label learning for facial action unit detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 3391–3399. https://doi.org/10.1109/CVPR.2016.369
Zhao K, Chu WS, De la Torre F, Cohn JF, Zhang H (2016) Joint patch and multi-label learning for facial action unit and holistic expression recognition. IEEE Trans Image Process 25(8):3931–3946. https://doi.org/10.1109/TIP.2016.2570550
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, J., Wang, C., Wang, K. et al. Lightweight network architecture using difference saliency maps for facial action unit detection. Appl Intell 52, 6354–6375 (2022). https://doi.org/10.1007/s10489-021-02755-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02755-y