Abstract
Current state-of-the-art methods in multi-modal fusion typically rely on generating a new shared representation space onto which multi-modal features are mapped for the goal of obtaining performance improvements by combining the individual modalities. Often, these heavily fine-tuned feature representations would have strong feature discriminability in their own spaces which may not be present in the fused subspace owing to the compression of information arising from multiple sources. To address this, we propose a new approach to fusion by enhancing the individual feature spaces through information exchange between the modalities. Essentially, domain adaptation is learnt by building a shared representation used for mutually enhancing each domain’s knowledge. In particular, the learning objective is modeled to modify the features with the overarching goal of improving the combined system performance. We apply our fusion method to the task of facial action unit (AU) recognition by learning to enhance the thermal and visible feature representations. We compare our approach to other recent fusion schemes and demonstrate its effectiveness on the MMSE dataset by outperforming previous techniques.
N. N. Lakshminarayana, D. D. Mohan, N. Sankaran—Equal contribution authors listed in alphabetical order.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bodla N, Zheng J, Xu H, Chen J, Castillo CD, Chellappa R (2017) Deep heterogeneous feature fusion for template-based face recognition. CoRR http://arxiv.org/abs/1702.04471
Chu WS, De la Torre F, Cohn JF (2017) Learning spatial and temporal cues for multi-label facial action unit detection. In: 2017 12th IEEE international conference on automatic face and gesture recognition (FG 2017). IEEE, pp 25–32
Corneanu CA, Simón MO, Cohn JF, Guerrero SE (2016) Survey on rgb, 3D, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE Trans Pattern Anal Mach Intell 38(8):1548–1568
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Ekman P, Friesen WV (1976) Measuring facial movement. Environ Psychol Nonverbal Behav 1(1):56–75
Ghosh S, Laksana E, Scherer S, Morency LP (2015) A Multi-label convolutional neural network approach to cross-domain action unit detection. In: Proceedings of ACII 2015. IEEE, Xi’an, China. http://ict.usc.edu/pubs/A%20Multi-label%20Convolutional%20Neural%20Network%20Approach%20to%20Cross-Domain%20Action%20Unit%20Detection.pdf
Gudi A, Tasli HE, Den Uyl TM, Maroulis A (2015) Deep learning based facs action unit occurrence and intensity estimation. In: Proceedings of the 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 6. IEEE, pp 1–5
Han S, Meng Z, Khan AS, Tong Y (2016) Incremental boosting convolutional neural network for facial action unit recognition. In: Advances in neural information processing systems, pp 109–117
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Weinberger KQ, van der Maaten L (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 1, p 3
Huang H, Liu H, Kong X, Lou X, Wang Z (2017) Heterogeneous massive feature fusion on grassmannian manifold. J Phys: Conf Ser 887:012066. (IOP Publishing)
Jaiswal S, Valstar M (2016) Deep learning the dynamic appearance and shape of facial action units. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–8
Jarlier S, Grandjean D, Delplanque S, N’diaye K, Cayeux I, Velazco MI, Sander D, Vuilleumier P, Scherer KR (2011) Thermal analysis of facial muscles contractions. IEEE Trans Affect Comput 2(1):2–9
Lahat D, Adalı T, Jutten C (2015) Multimodal data fusion: an overview of methods, challenges and prospects. Proc IEEE 103(9):1449–1477. https://hal.archives-ouvertes.fr/hal-01179853
Lin G, Fan G, Kang X, Zhang E, Yu L (2016) Heterogeneous feature structure fusion for classification. Pattern Recognit. 53:1–11
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Sankaran N, Tulyakov S, Setlur S, Govindaraju V (2018) Metadata-based feature aggregation network for face recognition. In: 2018 11th IAPR international conference on biometrics (ICB 2018). IEEE
Saxe AM, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey BD, Cox DD (2018) On the information bottleneck theory of deep learning. In: International conference on learning representations
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in neural information processing systems. pp 2377–2385
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning
Tian YL, Kanade T, Cohn JF (2005) Facial expression analysis. In: Handbook of face recognition. Springer, Berlin, pp 247–275
Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. CoRR http://arxiv.org/abs/1708.01471
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503
Zhang Z, Girard JM, Wu Y, Zhang X, Liu P, Ciftci U, Canavan S, Reale M, Horowitz A, Yang H, Cohn JF, Ji Q, Yin L (2016) Multimodal spontaneous emotion corpus for human behavior analysis. In: 2016 IEEE CVPR, pp 3438–3446. https://doi.org/10.1109/CVPR.2016.374
Zhao H, Tian M, Sun S, Shao J, Yan J, Yi S, Wang X, Tang X (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1077–1085
Zhao K, Chu WS, De la Torre F, Cohn JF, Zhang H (2016) Joint patch and multi-label learning for facial action unit and holistic expression recognition. IEEE Trans Image Process 25(8):3931–3946
Acknowledgements
This material is based upon work partially supported by the National Science Foundation under Grant IIP \(\#1266183\).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Lakshminarayana, N.N., Mohan, D.D., Sankaran, N., Setlur, S., Govindaraju, V. (2020). Multi-modal Conditional Feature Enhancement for Facial Action Unit Recognition. In: Singh, R., Vatsa, M., Patel, V., Ratha, N. (eds) Domain Adaptation for Visual Understanding. Springer, Cham. https://doi.org/10.1007/978-3-030-30671-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-30671-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30670-0
Online ISBN: 978-3-030-30671-7
eBook Packages: Computer ScienceComputer Science (R0)