research-article

Adaptive Multimodal Fusion for Facial Action Units Recognition

Authors:
Huiyuan Yang

State Univerisity of New York at Binghamton, Binghamton, NY, USA

State Univerisity of New York at Binghamton, Binghamton, NY, USA
View Profile

,
Taoyue Wang

State Univerisity of New York at Binghamton, Binghamton, NY, USA

State Univerisity of New York at Binghamton, Binghamton, NY, USA
View Profile

,
Lijun Yin

State University of New York at Binghamton, Binghamton, NY, USA

State University of New York at Binghamton, Binghamton, NY, USA
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 2982–2990https://doi.org/10.1145/3394171.3413538

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2982–2990

ABSTRACT

Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests.

Supplemental Material

3394171.3413538.mp4

mp4

9.7 MB

Download

References

Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1165--1174.Google ScholarCross Ref
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence , Vol. 41, 2 (2018), 423--443.Google Scholar
Mina Bishay and Ioannis Patras. 2017. Fusing multilabel deep networks for facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 681--688.Google ScholarCross Ref
George Caridakis, Ginevra Castellano, Loic Kessous, Amaryllis Raouzaiou, Lori Malatesta, Stelios Asteriadis, and Kostas Karpouzis. 2007. Multimodal emotion recognition from expressive faces, body gestures and speech. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 375--388.Google ScholarCross Ref
Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, and Niki Trigoni. 2019. Selective sensor fusion for neural visual-inertial odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10542--10551.Google ScholarCross Ref
Wen-Sheng Chu, Fernando De la Torre Frade, and Jeffrey Cohn. 2017. Learning Spatial and Temporal Cues for Multi-label Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google Scholar
Ciprian Corneanu, Meysam Madadi, and Sergio Escalera. 2018. Deep structure inference network for facial action unit recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 298--313.Google ScholarCross Ref
Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google Scholar
Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 (2014).Google Scholar
Helmut Grabner, Michael Grabner, and Horst Bischof. 2006. Real-time tracking via on-line boosting.. In Bmvc, Vol. 1. 6.Google Scholar
Amogh Gudi, H Emrah Tasli, Tim M den Uyl, and Andreas Maroulis. 2015. Deep learning based FACS action unit occurrence and intensity estimation. In Automatic Face and Gesture Recognition Workshops .Google ScholarCross Ref
William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, applications and emerging research trends. IEEE Access , Vol. 6 (2018), 24411--24432.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et almbox. 2018. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis , Vol. 49 (2018), 1--13.Google Scholar
Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. 1107--1110.Google ScholarDigital Library
Ramin Irani, Kamal Nasrollahi, Marc O Simon, Ciprian A Corneanu, Sergio Escalera, Chris Bahnsen, Dennis H Lundtoft, Thomas B Moeslund, Tanja L Pedersen, Maria-Louise Klitgaard, et almbox. 2015. Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 88--95.Google Scholar
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).Google Scholar
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
Nagashri N Lakshminarayana, Nishant Sankaran, Srirangaraj Setlur, and Venu Govindaraju. 2019. Multimodal Deep Feature Aggregation for Facial Action Unit Recognition using Visible Images and Physiological Signals. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--4.Google ScholarCross Ref
Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8594--8601.Google ScholarDigital Library
Huibin Li, Huaxiong Ding, Di Huang, Yunhong Wang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2015. An efficient multimodal 2DGoogle Scholar
3D feature-based approach to automatic facial expression recognition. Computer Vision and Image Understanding , Vol. 140 (2015), 83--92.Google Scholar
Huibin Li, Jian Sun, Zongben Xu, and Liming Chen. 2017c. Multimodal 2DGoogle Scholar
3D facial expression recognition with deep fusion convolutional neural network. IEEE Transactions on Multimedia , Vol. 19, 12 (2017), 2816--2831.Google Scholar
Wei Li, Farnaz Abtahi, and Zhigang Zhu. 2017a. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 6766--6775.Google ScholarCross Ref
Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2017b. EAC-Net: A Region-based Deep Enhancing and Cropping Approach for Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google Scholar
Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2018).Google Scholar
Peng Liu, Zheng Zhang, Huiyuan Yang, and Lijun Yin. 2019. Multi-modality empowered network for facial action unit detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2175--2184.Google ScholarCross Ref
Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016).Google Scholar
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. ICML (2011).Google Scholar
Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local Relationship Learning with Person-specific Shape Regularization for Facial Action Unit Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11917--11926.Google ScholarCross Ref
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024--8035.Google Scholar
Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) . 705--720.Google ScholarCross Ref
Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning. IEEE Transactions on Affective Computing (2019).Google Scholar
Chen Shen, Guo-Jun Qi, Rongxin Jiang, Zhongming Jin, Hongwei Yong, Yaowu Chen, and Xian-Sheng Hua. 2018. Sharp attention network via adaptive sampling for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 29, 10 (2018), 3016--3027.Google ScholarCross Ref
Luciano Spinello, Rudolph Triebel, and Roland Siegwart. 2010. Multiclass multimodal detection and tracking in urban environments. The International Journal of Robotics Research , Vol. 29, 12 (2010), 1498--1515.Google ScholarDigital Library
Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning factorized multimodal representations. ICLR (2019).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Gyanendra K Verma and Uma Shanker Tiwary. 2014. Multimodal fusion framework: A multiresolution approach for emotion classification and recognition from physiological signals. NeuroImage , Vol. 102 (2014), 162--172.Google ScholarCross Ref
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction . 569--576.Google ScholarDigital Library
Can Wang and Shangfei Wang. 2018. Personalized multiple facial action unit recognition through generative adversarial recognition network. In Proceedings of the 26th ACM international conference on Multimedia. 302--310.Google ScholarDigital Library
Chongliang Wu, Shangfei Wang, Bowen Pan, and Huaping Chen. 2016. Facial expression recognition with deep two-view support vector machine. In Proceedings of the 24th ACM international conference on Multimedia. 616--620.Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.Google ScholarDigital Library
Huiyuan Yang and Lijun Yin. 2019. Learning Temporal Information From A Single Image For AU Detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--8.Google Scholar
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision. 6023--6032.Google ScholarCross Ref
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. EMNLP (2017).Google Scholar
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).Google Scholar
Wei Zhang, Youmei Zhang, Lin Ma, Jingwei Guan, and Shijie Gong. 2015. Multimodal learning for facial expression recognition. Pattern Recognition , Vol. 48, 10 (2015), 3191--3202.Google ScholarDigital Library
Wenwei Zhang, Hui Zhou, Shuyang Sun, Zhe Wang, Jianping Shi, and Chen Change Loy. 2019. Robust Multi-Modality Multi-Object Tracking. In Proceedings of the IEEE International Conference on Computer Vision. 2365--2374.Google ScholarCross Ref
Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. 2014. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing , Vol. 32, 10 (2014), 692--706.Google ScholarCross Ref
Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin. 2016. Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google Scholar
Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F Cohn, and Honggang Zhang. 2015. Joint patch and multi-label learning for facial action unit detection. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017. Toward multimodal image-to-image translation. In Advances in neural information processing systems. 465--476.Google Scholar

Index Terms

Adaptive Multimodal Fusion for Facial Action Units Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Activity recognition and understanding
        Biometrics

Recommendations

Automatic stress analysis from facial videos based on deep facial action units recognition
Abstract
Stress conditions are manifested in different human body’s physiological processes and the human face. Facial expressions are modelled consistently through the Facial Action Coding System (FACS) using the facial Action Units (AU) parameters. This ...
Read More
Automatic recognition of lower facial action units
MB '10: Proceedings of the 7th International Conference on Methods and Techniques in Behavioral Research

The face is an important source of information in multimodal communication. Facial expressions are generated by contractions of facial muscles, which lead to subtle changes in the area of the eyelids, eye brows, nose, lips and skin texture, often ...
Read More
Recognizing action units for facial expression analysis
Multimodal interface for human-machine communication

Most automatic expression analysis systems attempt to recognize a small set of prototypic expressions, such as happiness, anger, surprise, and fear. Such prototypic expressions, however, occur rather infrequently. Human emotions and intentions are more ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
au
facial action units
multi-modalities
multimodal fusion
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,064
  Total Downloads
- Downloads (Last 12 months)355
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adaptive Multimodal Fusion for Facial Action Units Recognition

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Automatic stress analysis from facial videos based on deep facial action units recognition

Automatic recognition of lower facial action units

Recognizing action units for facial expression analysis