ABSTRACT
Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests.
Supplemental Material
- Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1165--1174.Google ScholarCross Ref
- Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence , Vol. 41, 2 (2018), 423--443.Google Scholar
- Mina Bishay and Ioannis Patras. 2017. Fusing multilabel deep networks for facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 681--688.Google ScholarCross Ref
- George Caridakis, Ginevra Castellano, Loic Kessous, Amaryllis Raouzaiou, Lori Malatesta, Stelios Asteriadis, and Kostas Karpouzis. 2007. Multimodal emotion recognition from expressive faces, body gestures and speech. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 375--388.Google ScholarCross Ref
- Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, and Niki Trigoni. 2019. Selective sensor fusion for neural visual-inertial odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10542--10551.Google ScholarCross Ref
- Wen-Sheng Chu, Fernando De la Torre Frade, and Jeffrey Cohn. 2017. Learning Spatial and Temporal Cues for Multi-label Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google Scholar
- Ciprian Corneanu, Meysam Madadi, and Sergio Escalera. 2018. Deep structure inference network for facial action unit recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 298--313.Google ScholarCross Ref
- Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google Scholar
- Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 (2014).Google Scholar
- Helmut Grabner, Michael Grabner, and Horst Bischof. 2006. Real-time tracking via on-line boosting.. In Bmvc, Vol. 1. 6.Google Scholar
- Amogh Gudi, H Emrah Tasli, Tim M den Uyl, and Andreas Maroulis. 2015. Deep learning based FACS action unit occurrence and intensity estimation. In Automatic Face and Gesture Recognition Workshops .Google ScholarCross Ref
- William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, applications and emerging research trends. IEEE Access , Vol. 6 (2018), 24411--24432.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et almbox. 2018. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis , Vol. 49 (2018), 1--13.Google Scholar
- Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. 1107--1110.Google ScholarDigital Library
- Ramin Irani, Kamal Nasrollahi, Marc O Simon, Ciprian A Corneanu, Sergio Escalera, Chris Bahnsen, Dennis H Lundtoft, Thomas B Moeslund, Tanja L Pedersen, Maria-Louise Klitgaard, et almbox. 2015. Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 88--95.Google Scholar
- Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).Google Scholar
- Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
- Nagashri N Lakshminarayana, Nishant Sankaran, Srirangaraj Setlur, and Venu Govindaraju. 2019. Multimodal Deep Feature Aggregation for Facial Action Unit Recognition using Visible Images and Physiological Signals. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--4.Google ScholarCross Ref
- Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8594--8601.Google ScholarDigital Library
- Huibin Li, Huaxiong Ding, Di Huang, Yunhong Wang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2015. An efficient multimodal 2DGoogle Scholar
- 3D feature-based approach to automatic facial expression recognition. Computer Vision and Image Understanding , Vol. 140 (2015), 83--92.Google Scholar
- Huibin Li, Jian Sun, Zongben Xu, and Liming Chen. 2017c. Multimodal 2DGoogle Scholar
- 3D facial expression recognition with deep fusion convolutional neural network. IEEE Transactions on Multimedia , Vol. 19, 12 (2017), 2816--2831.Google Scholar
- Wei Li, Farnaz Abtahi, and Zhigang Zhu. 2017a. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 6766--6775.Google ScholarCross Ref
- Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2017b. EAC-Net: A Region-based Deep Enhancing and Cropping Approach for Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google Scholar
- Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2018).Google Scholar
- Peng Liu, Zheng Zhang, Huiyuan Yang, and Lijun Yin. 2019. Multi-modality empowered network for facial action unit detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2175--2184.Google ScholarCross Ref
- Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016).Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. ICML (2011).Google Scholar
- Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local Relationship Learning with Person-specific Shape Regularization for Facial Action Unit Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11917--11926.Google ScholarCross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024--8035.Google Scholar
- Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) . 705--720.Google ScholarCross Ref
- Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning. IEEE Transactions on Affective Computing (2019).Google Scholar
- Chen Shen, Guo-Jun Qi, Rongxin Jiang, Zhongming Jin, Hongwei Yong, Yaowu Chen, and Xian-Sheng Hua. 2018. Sharp attention network via adaptive sampling for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 29, 10 (2018), 3016--3027.Google ScholarCross Ref
- Luciano Spinello, Rudolph Triebel, and Roland Siegwart. 2010. Multiclass multimodal detection and tracking in urban environments. The International Journal of Robotics Research , Vol. 29, 12 (2010), 1498--1515.Google ScholarDigital Library
- Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning factorized multimodal representations. ICLR (2019).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
- Gyanendra K Verma and Uma Shanker Tiwary. 2014. Multimodal fusion framework: A multiresolution approach for emotion classification and recognition from physiological signals. NeuroImage , Vol. 102 (2014), 162--172.Google ScholarCross Ref
- Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction . 569--576.Google ScholarDigital Library
- Can Wang and Shangfei Wang. 2018. Personalized multiple facial action unit recognition through generative adversarial recognition network. In Proceedings of the 26th ACM international conference on Multimedia. 302--310.Google ScholarDigital Library
- Chongliang Wu, Shangfei Wang, Bowen Pan, and Huaping Chen. 2016. Facial expression recognition with deep two-view support vector machine. In Proceedings of the 24th ACM international conference on Multimedia. 616--620.Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.Google ScholarDigital Library
- Huiyuan Yang and Lijun Yin. 2019. Learning Temporal Information From A Single Image For AU Detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--8.Google Scholar
- Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision. 6023--6032.Google ScholarCross Ref
- Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. EMNLP (2017).Google Scholar
- Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).Google Scholar
- Wei Zhang, Youmei Zhang, Lin Ma, Jingwei Guan, and Shijie Gong. 2015. Multimodal learning for facial expression recognition. Pattern Recognition , Vol. 48, 10 (2015), 3191--3202.Google ScholarDigital Library
- Wenwei Zhang, Hui Zhou, Shuyang Sun, Zhe Wang, Jianping Shi, and Chen Change Loy. 2019. Robust Multi-Modality Multi-Object Tracking. In Proceedings of the IEEE International Conference on Computer Vision. 2365--2374.Google ScholarCross Ref
- Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. 2014. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing , Vol. 32, 10 (2014), 692--706.Google ScholarCross Ref
- Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin. 2016. Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google Scholar
- Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F Cohn, and Honggang Zhang. 2015. Joint patch and multi-label learning for facial action unit detection. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
- Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017. Toward multimodal image-to-image translation. In Advances in neural information processing systems. 465--476.Google Scholar
Index Terms
- Adaptive Multimodal Fusion for Facial Action Units Recognition
Recommendations
Automatic stress analysis from facial videos based on deep facial action units recognition
AbstractStress conditions are manifested in different human body’s physiological processes and the human face. Facial expressions are modelled consistently through the Facial Action Coding System (FACS) using the facial Action Units (AU) parameters. This ...
Automatic recognition of lower facial action units
MB '10: Proceedings of the 7th International Conference on Methods and Techniques in Behavioral ResearchThe face is an important source of information in multimodal communication. Facial expressions are generated by contractions of facial muscles, which lead to subtle changes in the area of the eyelids, eye brows, nose, lips and skin texture, often ...
Recognizing action units for facial expression analysis
Multimodal interface for human-machine communicationMost automatic expression analysis systems attempt to recognize a small set of prototypic expressions, such as happiness, anger, surprise, and fear. Such prototypic expressions, however, occur rather infrequently. Human emotions and intentions are more ...
Comments