skip to main content
10.1145/3394171.3413538acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Adaptive Multimodal Fusion for Facial Action Units Recognition

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests.

Skip Supplemental Material Section

Supplemental Material

3394171.3413538.mp4

mp4

9.7 MB

References

  1. Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1165--1174.Google ScholarGoogle ScholarCross RefCross Ref
  2. Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence , Vol. 41, 2 (2018), 423--443.Google ScholarGoogle Scholar
  3. Mina Bishay and Ioannis Patras. 2017. Fusing multilabel deep networks for facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 681--688.Google ScholarGoogle ScholarCross RefCross Ref
  4. George Caridakis, Ginevra Castellano, Loic Kessous, Amaryllis Raouzaiou, Lori Malatesta, Stelios Asteriadis, and Kostas Karpouzis. 2007. Multimodal emotion recognition from expressive faces, body gestures and speech. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 375--388.Google ScholarGoogle ScholarCross RefCross Ref
  5. Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, and Niki Trigoni. 2019. Selective sensor fusion for neural visual-inertial odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10542--10551.Google ScholarGoogle ScholarCross RefCross Ref
  6. Wen-Sheng Chu, Fernando De la Torre Frade, and Jeffrey Cohn. 2017. Learning Spatial and Temporal Cues for Multi-label Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google ScholarGoogle Scholar
  7. Ciprian Corneanu, Meysam Madadi, and Sergio Escalera. 2018. Deep structure inference network for facial action unit recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 298--313.Google ScholarGoogle ScholarCross RefCross Ref
  8. Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google ScholarGoogle Scholar
  9. Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 (2014).Google ScholarGoogle Scholar
  10. Helmut Grabner, Michael Grabner, and Horst Bischof. 2006. Real-time tracking via on-line boosting.. In Bmvc, Vol. 1. 6.Google ScholarGoogle Scholar
  11. Amogh Gudi, H Emrah Tasli, Tim M den Uyl, and Andreas Maroulis. 2015. Deep learning based FACS action unit occurrence and intensity estimation. In Automatic Face and Gesture Recognition Workshops .Google ScholarGoogle ScholarCross RefCross Ref
  12. William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, applications and emerging research trends. IEEE Access , Vol. 6 (2018), 24411--24432.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et almbox. 2018. Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis , Vol. 49 (2018), 1--13.Google ScholarGoogle Scholar
  15. Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. 1107--1110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ramin Irani, Kamal Nasrollahi, Marc O Simon, Ciprian A Corneanu, Sergio Escalera, Chris Bahnsen, Dennis H Lundtoft, Thomas B Moeslund, Tanja L Pedersen, Maria-Louise Klitgaard, et almbox. 2015. Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 88--95.Google ScholarGoogle Scholar
  17. Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).Google ScholarGoogle Scholar
  18. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google ScholarGoogle Scholar
  19. Nagashri N Lakshminarayana, Nishant Sankaran, Srirangaraj Setlur, and Venu Govindaraju. 2019. Multimodal Deep Feature Aggregation for Facial Action Unit Recognition using Visible Images and Physiological Signals. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  20. Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8594--8601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Huibin Li, Huaxiong Ding, Di Huang, Yunhong Wang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2015. An efficient multimodal 2DGoogle ScholarGoogle Scholar
  22. 3D feature-based approach to automatic facial expression recognition. Computer Vision and Image Understanding , Vol. 140 (2015), 83--92.Google ScholarGoogle Scholar
  23. Huibin Li, Jian Sun, Zongben Xu, and Liming Chen. 2017c. Multimodal 2DGoogle ScholarGoogle Scholar
  24. 3D facial expression recognition with deep fusion convolutional neural network. IEEE Transactions on Multimedia , Vol. 19, 12 (2017), 2816--2831.Google ScholarGoogle Scholar
  25. Wei Li, Farnaz Abtahi, and Zhigang Zhu. 2017a. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 6766--6775.Google ScholarGoogle ScholarCross RefCross Ref
  26. Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2017b. EAC-Net: A Region-based Deep Enhancing and Cropping Approach for Facial Action Unit Detection. In Automatic Face and Gesture Recognition (FG) .Google ScholarGoogle Scholar
  27. Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2018).Google ScholarGoogle Scholar
  28. Peng Liu, Zheng Zhang, Huiyuan Yang, and Lijun Yin. 2019. Multi-modality empowered network for facial action unit detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2175--2184.Google ScholarGoogle ScholarCross RefCross Ref
  29. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016).Google ScholarGoogle Scholar
  30. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. ICML (2011).Google ScholarGoogle Scholar
  31. Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local Relationship Learning with Person-specific Shape Regularization for Facial Action Unit Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11917--11926.Google ScholarGoogle ScholarCross RefCross Ref
  32. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024--8035.Google ScholarGoogle Scholar
  33. Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) . 705--720.Google ScholarGoogle ScholarCross RefCross Ref
  34. Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning. IEEE Transactions on Affective Computing (2019).Google ScholarGoogle Scholar
  35. Chen Shen, Guo-Jun Qi, Rongxin Jiang, Zhongming Jin, Hongwei Yong, Yaowu Chen, and Xian-Sheng Hua. 2018. Sharp attention network via adaptive sampling for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 29, 10 (2018), 3016--3027.Google ScholarGoogle ScholarCross RefCross Ref
  36. Luciano Spinello, Rudolph Triebel, and Roland Siegwart. 2010. Multiclass multimodal detection and tracking in urban environments. The International Journal of Robotics Research , Vol. 29, 12 (2010), 1498--1515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning factorized multimodal representations. ICLR (2019).Google ScholarGoogle Scholar
  38. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  39. Gyanendra K Verma and Uma Shanker Tiwary. 2014. Multimodal fusion framework: A multiresolution approach for emotion classification and recognition from physiological signals. NeuroImage , Vol. 102 (2014), 162--172.Google ScholarGoogle ScholarCross RefCross Ref
  40. Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction . 569--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Can Wang and Shangfei Wang. 2018. Personalized multiple facial action unit recognition through generative adversarial recognition network. In Proceedings of the 26th ACM international conference on Multimedia. 302--310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chongliang Wu, Shangfei Wang, Bowen Pan, and Huaping Chen. 2016. Facial expression recognition with deep two-view support vector machine. In Proceedings of the 24th ACM international conference on Multimedia. 616--620.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Huiyuan Yang and Lijun Yin. 2019. Learning Temporal Information From A Single Image For AU Detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--8.Google ScholarGoogle Scholar
  45. Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision. 6023--6032.Google ScholarGoogle ScholarCross RefCross Ref
  46. Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. EMNLP (2017).Google ScholarGoogle Scholar
  47. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).Google ScholarGoogle Scholar
  48. Wei Zhang, Youmei Zhang, Lin Ma, Jingwei Guan, and Shijie Gong. 2015. Multimodal learning for facial expression recognition. Pattern Recognition , Vol. 48, 10 (2015), 3191--3202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wenwei Zhang, Hui Zhou, Shuyang Sun, Zhe Wang, Jianping Shi, and Chen Change Loy. 2019. Robust Multi-Modality Multi-Object Tracking. In Proceedings of the IEEE International Conference on Computer Vision. 2365--2374.Google ScholarGoogle ScholarCross RefCross Ref
  50. Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. 2014. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing , Vol. 32, 10 (2014), 692--706.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin. 2016. Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle Scholar
  52. Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F Cohn, and Honggang Zhang. 2015. Joint patch and multi-label learning for facial action unit detection. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarGoogle ScholarCross RefCross Ref
  53. Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017. Toward multimodal image-to-image translation. In Advances in neural information processing systems. 465--476.Google ScholarGoogle Scholar

Index Terms

  1. Adaptive Multimodal Fusion for Facial Action Units Recognition

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '20: Proceedings of the 28th ACM International Conference on Multimedia
          October 2020
          4889 pages
          ISBN:9781450379885
          DOI:10.1145/3394171

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 October 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader