ABSTRACT
Whilst deep learning techniques have achieved excellent emotion prediction, they still require large amounts of labelled training data, which are (a) onerous and tedious to compile, and (b) prone to errors and biases. We propose Multi-Task Contrastive Learning for Affect Representation (MT-CLAR) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images (a) the (dis)similarity between the facial expressions, and (b) the difference in valence and arousal levels of the two faces. We further extend the image-based MT-CLAR framework for automated video labelling where, given one or a few labelled video frames (termed support-set), MT-CLAR labels the remainder of the video for valence and arousal. Experiments are performed on the AFEW-VA dataset with multiple support-set configurations; moreover, supervised learning on representations learnt via MT-CLAR are used for valence, arousal and categorical emotion prediction on the AffectNet and AFEW-VA datasets. The results show that valence and arousal predictions via MT-CLAR are very comparable to the state-of-the-art (SOTA), and we significantly outperform SOTA with a support-set ≈6% the size of the video dataset.
- Babak Joze Abbaschian, Daniel Sierra-Sosa, and Adel Elmaghraby. 2021. Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models. Sensors, Vol. 21, 4 (2021). https://doi.org/10.3390/s21041249Google ScholarCross Ref
- Youngdo Ahn, Sung Joo Lee, and Jong Won Shin. 2021. Cross-corpus speech emotion recognition based on few-shot learning and domain adaptation. IEEE Signal Processing Letters, Vol. 28 (2021), 1190--1194.Google ScholarCross Ref
- Abdallah El Ali, Torben Wallbaum, Merlin Wasmann, Wilko Heuten, and Susanne Boll. 2017. Face2Emoji: Using Facial Emotional Expressions to Filter Emojis. In Conference on Human Factors in Computing Systems. ACM, 1577--1584.Google Scholar
- Maneesh Bilalpur, Seyed Mostafa Kia, Manisha Chawla, Tat-Seng Chua, and Ramanathan Subramanian. 2017. Gender and Emotion Recognition with Implicit User Signals. In ACM Int'l Conference on Multimodal Interaction. 379--387. https://doi.org/10.1145/3136755.3136790Google ScholarDigital Library
- Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017). 1021--1030. https://doi.org/10.1109/ICCV.2017.116Google ScholarCross Ref
- Chris Careaga, Brian Hutchinson, Nathan Hodas, and Lawrence Phillips. 2019. Metric-based few-shot learning for video action recognition. arXiv preprint arXiv:1909.09602 (2019).Google Scholar
- R Caruana. 1993. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In Proceedings of the Tenth International Conference on Machine Learning. San Francisco, CA, USA, 41--48.Google ScholarCross Ref
- Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-Task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. Mountain View, CA, USA, 19--26.Google ScholarDigital Library
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (ICML '20), Vol. 119. PMLR, 1597--1607.Google Scholar
- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1. IEEE, 539--546.Google ScholarDigital Library
- Anca-Nicoleta Ciubotaru, Arnout Devos, Behzad Bozorgtabar, Jean-Philippe Thiran, and Maria Gabrani. 2019. Revisiting few-shot learning for facial expression recognition. arXiv preprint arXiv:1912.02751 (2019).Google Scholar
- Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia, Vol. 19, 3 (2012), 34--41. https://doi.org/10.1109/MMUL.2012.26Google ScholarDigital Library
- Kexin Feng and Theodora Chaspari. 2021. Few-Shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network With Adaptive Sample Pair Formation. IEEE Transactions on Affective Computing, Vol. 14, 2 (2021), 1627--1633. https://doi.org/10.1109/TAFFC.2021.3109485Google ScholarDigital Library
- Xavier Gastaldi. 2017. Shake-shake regularization. arXiv preprint arXiv:1705.07485 (2017).Google Scholar
- Maria Gendron, Carlos Crivelli, and Lisa Feldman Barrett. 2018. Universality reconsidered: Diversity in making meaning of facial expressions. Current Directions in Psychological Science, Vol. 27, 4 (2018), 211--219.Google ScholarCross Ref
- Sebastian Handrich, Laslo Dinges, Ayoub Al-Hamadi, Philipp Werner, and Zaher Al Aghbari. 2020. Simultaneous prediction of valence/arousal and emotions on AffectNet, Aff-Wild and AFEW-VA. Procedia Computer Science, Vol. 170 (2020), 634--641.Google ScholarCross Ref
- Wassan Hayale, Pooran Negi, and Mohammad Mahoor. 2019. Facial Expression Recognition Using Deep Siamese Neural Networks with a Supervised Loss function. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (Lille, France). IEEE, 1--7. https://doi.org/10.1109/FG.2019.8756571Google ScholarDigital Library
- Wassan Hayale, Pooran Singh Negi, and Mohammad Mahoor. 2021. Deep Siamese Neural Networks for Facial Expression Recognition in the Wild. IEEE Transactions on Affective Computing, Vol. 14, 2 (2021), 1148--1158. https://doi.org/10.1109/TAFFC.2021.3077248Google ScholarDigital Library
- Nathan Hilliard, Lawrence Phillips, Scott Howland, Artëm Yankov, Courtney D Corley, and Nathan O Hodas. 2018. Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376 (2018).Google Scholar
- Sepp Hochreiter, A Steven Younger, and Peter R Conwell. 2001. Learning to Learn Using Gradient Descent. In Artificial Neural Networks - ICANN 2001 (Vienna, Austria). Springer, 87--94.Google Scholar
- Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (Brisbane, Australia) (AVEC '15). Association for Computing Machinery, New York, NY, USA, 41--48. https://doi.org/10.1145/2808196.2811640Google ScholarDigital Library
- Youngkyoon Jang, Hatice Gunes, and Ioannis Patras. 2019. Registration-free Face-SSD: Single shot analysis of smiles, facial attributes, and affect in the wild. Computer Vision and Image Understanding, Vol. 182 (2019), 17--29. https://doi.org/10.1016/j.cviu.2019.01.006Google ScholarDigital Library
- Euiseok Jeong, Geesung Oh, and Sejoon Lim. 2022. Multi-Task Learning for Human Affect Prediction With Auditory-Visual Synchronized Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2438--2445.Google ScholarCross Ref
- Longlong Jing and Yingli Tian. 2020. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 11 (2020), 4037--4058. https://doi.org/10.1109/TPAMI.2020.2992393Google ScholarCross Ref
- Daeha Kim and Byung Cheol Song. 2022. Emotion-Aware Multi-View Contrastive Learning for Facial Emotion Recognition. In Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIII (Tel Aviv, Israel). Springer-Verlag, Berlin, Heidelberg, 178--195. https://doi.org/10.1007/978-3-031-19778-9_11Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Dimitrios Kollias, Shiyang Cheng, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. Deep Neural Network Augmentation: Generating Faces for Affect Analysis. International Journal of Computer Vision, Vol. 128 (Feb 2020), 1455--1484. https://doi.org/10.1007/s11263-020-01304--3Google ScholarDigital Library
- Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. 2019. Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond. International Journal of Computer Vision, Vol. 127, 6-7 (2019), 907--929. https://doi.org/10.1007/s11263-019-01158-4Google ScholarDigital Library
- Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, and Maja Pantic. 2020. Factorized Higher-Order CNNs With an Application to Spatio-Temporal Emotion Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6059--6068. https://doi.org/10.1109/CVPR42600.2020.00610Google ScholarCross Ref
- Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. 2017. AFEW-VA database for valence and arousal estimation in-the-wild. Image and Vision Computing, Vol. 65 (2017), 23--36.Google ScholarCross Ref
- Shan Li and Weihong Deng. 2022. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing, Vol. 13, 3 (2022), 1195--1215. https://doi.org/10.1109/TAFFC.2020.2981446Google ScholarCross Ref
- Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. 2018. Speech emotion recognition via contrastive loss under siamese networks. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data (Seoul, Republic of Korea). 21--26. https://doi.org/10.1145/3267935.3267946Google ScholarDigital Library
- Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. 2020. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing, Vol. 413 (Nov 2020), 145--157.Google Scholar
- Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. 2016. Large-Margin Softmax Loss for Convolutional Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML'16). JMLR.org, 507--516.Google Scholar
- Joseph A Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia A Reuter-Lorenz. 2005. Emotional category data on images from the International Affective Picture System. Behavior Research Methods, Vol. 37, 4 (2005), 626--630.Google ScholarCross Ref
- Anna Mitenkova, Jean Kossaifi, Yannis Panagakis, and Maja Pantic. 2019. Valence and Arousal Estimation In-The-Wild with Tensor Methods. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). 1--7. https://doi.org/10.1109/FG.2019.8756619Google ScholarDigital Library
- Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 18--31. https://doi.org/10.1109/TAFFC.2017.2740923Google ScholarDigital Library
- Soujanya Narayana, Ibrahim Radwan, Ravikiran Parameshwara, Iman Abbasnejad, Akshay Asthana, Ramanathan Subramanian, and Roland Goecke. 2023. A Weakly Supervised Approach to Emotion-change Prediction and Improved Mood Inference. arXiv preprint arXiv:2306.06979 (2023).Google Scholar
- Soujanya Narayana, Ramanathan Subramanian, Ibrahim Radwan, and Roland Goecke. 2022. To Improve Is to Change: Towards Improving Mood Prediction by Learning Changes in Emotion. In Companion Publication of the 2022 International Conference on Multimodal Interaction (Bengaluru, India) (ICMI '22 Companion). Association for Computing Machinery, New York, NY, USA, 36--41. https://doi.org/10.1145/3536220.3563685Google ScholarDigital Library
- Pankaj Pandey, Gulshan Sharma, Krishna. P. Miyapuram, Ramanathan Subramanian, and Derek Lomas. 2022. Music Identification Using Brain Responses to Initial Snippets. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1246--1250. https://doi.org/10.1109/ICASSP43922.2022.9747332Google ScholarCross Ref
- Ravikiran Parameshwara, Ibrahim Radwan, Ramanathan Subramanian, and Roland Goecke. 2023. Examining Subject-Dependent and Subject-Independent Human Affect Inference from Limited Video Data. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1--6. https://doi.org/10.1109/FG57933.2023.10042798Google ScholarDigital Library
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (Vancouver, Canada), H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., 8026--8037.Google ScholarDigital Library
- Tomas Pfister, James Charles, and Andrew Zisserman. 2014. Domain-adaptive discriminative one-shot learning of gestures. In Computer Vision-ECCV 2014: 13th European Conference, Part VI 13 (Zurich, Switzerland) (Lecture Notes in Computer Science, Vol. 8694). Springer, 814--829. https://doi.org/10.1007/978-3-319-10599-4_52Google ScholarCross Ref
- Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2021. Emotion Knowledge Driven Video Highlight Detection. IEEE Transactions on Multimedia, Vol. 23 (2021), 3999--4013. https://doi.org/10.1109/TMM.2020.3035285Google ScholarCross Ref
- Anoop Kolar Rajagopal, Subramanian Ramanathan, Elisa Ricci, Radu L. Vieriu, Oswald Lanz, Kalpathi Ramakrishnan, and Nicu Sebe. 2014. Exploring Transfer Learning Approaches for Head Pose Classification from Multi-view Surveillance Images. International Journal of Computuer Vision, Vol. 109, 1--2 (2014), 146--167.Google Scholar
- Shuvendu Roy and Ali Etemad. 2021. Spatiotemporal Contrastive Learning of Facial Expressions in Videos. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 1--8. https://doi.org/10.1109/ACII52823.2021.9597460Google ScholarCross Ref
- Xinke Shen, Xianggen Liu, Xin Hu, Dan Zhang, and Sen Song. 2022. Contrastive Learning of Subject-Invariant EEG Representations for Cross-Subject Emotion Recognition. IEEE Transactions on Affective Computing (2022). https://doi.org/10.1109/TAFFC.2022.3164516Google ScholarDigital Library
- Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Ramanathan Subramanian. 2017. Affect Recognition in Ads with Application to Computational Advertising. In ACM Int'l Conference on Multimedia. 1148--1156. https://doi.org/10.1145/3123266.3123444Google ScholarDigital Library
- Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. 2020. EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks. IEEE Transactions on Affective Computing, Vol. 11, 3 (2020), 532--541. https://doi.org/10.1109/TAFFC.2018.2817622Google ScholarCross Ref
- Xuran Sun, Jiabei Zeng, and Shiguang Shan. 2021. Emotion-aware Contrastive Learning for Facial Action Unit Detection. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 01--08. https://doi.org/10.1109/FG52635.2021.9666945Google ScholarDigital Library
- Mani Kumar Tellamekala and Michel Valstar. 2019. Temporally Coherent Visual Representations for Dimensional Affect Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 1--7. https://doi.org/10.1109/ACII.2019.8925529Google ScholarCross Ref
- Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. 2021. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence, Vol. 3, 1 (2021), 42--50. https://doi.org/10.1038/s42256-020-00280-0Google ScholarCross Ref
- Ivan Y. Tyukin, Alexander N. Gorban, Muhammad H. Alkhudaydi, and Qinghua Zhou. 2021. Demystification of Few-shot and One-shot Learning. In 2021 International Joint Conference on Neural Networks (IJCNN). 1--7. https://doi.org/10.1109/IJCNN52387.2021.9534395Google ScholarCross Ref
- Shu-Hui Wang and Chiou-Ting Hsu. 2017. AST-Net: An Attribute-based Siamese Temporal Network for Real-Time Emotion Recognition. In British Machine Vision Conference 2017 (BMVC 2017) (London, UK).Google Scholar
- Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. Generalizing from a Few Examples: A Survey on Few-Shot Learning. Comput. Surveys, Vol. 53, 3, Article 63 (Jun 2020). https://doi.org/10.1145/3386252Google ScholarDigital Library
- Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5177--5186. https://doi.org/10.1109/CVPR.2018.00543Google ScholarCross Ref
- Rui Xia and Yang Liu. 2015. A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space. IEEE Transactions on Affective Computing, Vol. 8, 1 (2015), 3--14. https://doi.org/10.1109/TAFFC.2015.2512598Google ScholarDigital Library
- Tianyi Zhang, Abdallah El Ali, Alan Hanjalic, and Pablo Cesar. 2022. Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals. IEEE Transactions on Multimedia (2022). https://doi.org/10.1109/TMM.2022.3165715Google ScholarDigital Library
- Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei Sun, Wenqiang Jiang, Fengyuan Zhang, Jinming Zhao, and Qin Jin. 2023. Multi-Task Learning Framework for Emotion Recognition In-the-Wild. In Computer Vision - ECCV 2022 Workshops. Springer, 143--156. https://doi.org/10.1007/978-3-031-25075-0_11Google ScholarDigital Library
- Zhilu Zhang and Mert Sabuncu. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Advances in Neural Information Processing Systems (Montréal, Canada) (NIPS'18, Vol. 31), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 8792--8802. https://proceedings.neurips.cc/paper_files/paper/2018/file/f2925f97bc13ad2852a7a551802feea0-Paper.pdfGoogle Scholar
- Xinyi Zou, Yan Yan, Jing-Hao Xue, Si Chen, and Hanzi Wang. 2022. When facial expression recognition meets few-shot learning: a joint and alternate learning framework. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), Vol. 36. 5367--5375.Google ScholarCross Ref
Index Terms
- Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning
Recommendations
PanoEmo, a set of affective 360-degree panoramas: a psychophysiological study
AbstractThere is a significant increase in the use of virtual reality in scientific experiments in the fields of ergonomics, education, and psychology among others. Many researchers successfully provoked different affective states in participants in order ...
Predicting Video Affect via Induced Affection in the Wild
ICMI '20: Proceedings of the 2020 International Conference on Multimodal InteractionCurating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to ...
Few-Shot Classification with Multi-task Self-supervised Learning
Neural Information ProcessingAbstractFew-shot learning aims to mitigate the need for large-scale annotated data in the real world. The focus of few-shot learning is how to quickly adapt to unseen tasks, which heavily depends on outstanding feature extraction ability. Motivated by the ...
Comments