Abstract
Design strategies of model architecture greatly affect the performance of tasks for multimodal classification. Neural network architectures in traditional models are designed manually, depending on human understanding for specific tasks, and generalization capability is limited. This paper mainly discusses exploring the optimal architecture for multimodal fusion using Neural Architecture Search. Neural architecture search relies on a controller to generate better architectures and predict the accuracy of given architectures. However, the controller evaluation for architectures is very time-consuming. We discuss a semi-supervised strategy for architectures evaluation to reduce the search time complexity; however, the performance degradation for the predictor is caused. A method for relational-graphic-predictive NAS (RGNAS) is therefore presented to compensate the insufficiency of labeled architectures for improving the accuracy of the predictor. RGNAS leverages the intrinsic relationship between labeled architectures and abundant unlabeled architectures to compensate the insufficiency of labeled architectures. A reasonable trade-off between accuracy and the search time complexity is achieved. We validate the effectiveness of the proposed method on different multimodal datasets (eNTERFACE05, AFEW9.0 and MM-IMDb). Extensive experiments demonstrate that our method outperforms the state of the arts and achieves better robustness and generalization performance.
Similar content being viewed by others
Data availibility statement
eNTERFACE05 dataset included in the manuscript can be found at: http://www.enterface.net/results/. AFEW9.0 dataset included in the manuscript can be found at: https://sites.google.com/site/emotiwchallenge/. MM-IMDb dataset included in the manscript can be found at: http://lisi1.unal.edu.co/mmimdb/. All other data are available from the authors upon reasonable request.
References
Acharya D, Huang Z, Paudel DP, Gool LV (2018) Covariance Pooling for Facial Expression Recognition 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis In International Conference on Machine Learning 1247–1255
Arevalo J, Solorio T, Montes-y Gómez M, González FA (2017) Gated multimodal units for information fusion ICLR Worshop
Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Bejani M, Gharavian Davood (2014) Audiovisual emotion recognition using anova feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412
Breiman L (1996) Bagging predicators. Mach Learn 24(2):123–140
Chen K (2019) The research of multimodal fusing based emotion recognition
Chen M, Wang S, Liang PP, Baltrušaitis Tadas, Zadeh A, Morency LP (2018) Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning In Proceedings of the 19th ACM International Conference on Multimodal Interaction ACM
Cohen J, Cohen JW, Cohen J, Cohen J, Cohen J, Cohen J et al (1988) Statistical power analysis for the behavioral science. Technometrics 31(4):499–500
Ejegwa P, Wen S, Feng Y, Zhang W, Tang N (2021) Novel Pythagorean fuzzy correlation measures via Pythagorean fuzzy deviation, variance and covariance with applications to pattern recognition and career placement. IEEE Trans Fuzzy Syst (99):1–1
Fan Y, Lu X, Li D, et al (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction 445–450
Feng Y, Yang X, Song Q, Cao J (2018) Synchronization of memristive neural networks with mixed delays via quantized intermittent control. Appl Math Comput 339:874–887
Feng Y, Zhang W, Xiong J, Li H, Rutkowski L (2020) Event-Triggering Interaction Scheme for Discrete-Time Decentralized Optimization With Nonuniform Step Sizes. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2019.2963330
Goodfellow I, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning, Proceedings of Machine Learning Research, pp 1319–1327
Guo Y, Zhang L, Hu Y, He X, Gao J (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition
Kang M, Ji K, Leng X, Lin Z (2017) Contextual region-based convolutional neural network with multilayer fusion for sar ship detection. Remote Sens 9(8):860
Kim DH, Song BC (2017) Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 529–535
Li F, Neverova N, Wolf C, Taylor G (2016) Modout: learning to fuse modalities via stochastic regularization. J Comput Vision Imag Syst 2(1)
Liu H, Simonyan K, Yang Y (2018) Darts: differentiable architecture search
Liu Y, Yuan Z, Zhou W, Li H (2019) Spatial and temporal mutual promotion for video-based person re-identification Proceedings of the AAAI Conference on Artificial Intelligence 33:8786–8793
Liu C, Zoph B, Shlens J, Hua W, Fei-Fei L, Yuille A, Huang J, Murphy K (2018) Progressive neural architecture search ECCV
Luo R, Tan X, Wang R, Qin T, Chen E, Liu TY (2020) Semi-supervised neural architecture search
Luo R, Tian F, Qin T, Liu T-Y (2018) Neural architecture optimization arXiv preprint arXiv:1808.07233
Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimedia Tools Appl 49(2):277–297
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Neverova N, Wolf C, Taylor G, Nebout F (2014) Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions Pattern Anal Mach Intell 38(8):1692–1706
Onasanya BO, Wen S, et al (2021) Fuzzy coefficient of impulsive intensity in a nonlinear impulsive control system. Neural Process Lett 53:4639–4657
Ouyang X, Kawaai S, Goh EGH, Shen S, Huang DY (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models In the 19th ACM International Conference ACM
Pérez-Rúa JM, Baccouche M, Pateux S (2018) Efficient progressive neural architecture search BMVC
Pham H, Guan MY, Zoph B, Le QV, Dean J (2018) Efficient neural architecture search via parameter sharing ICML
Sarvestani RR, Boostani R (2017) Ff-skpcca: kernel probabilistic canonical correlation analysis. Appl Intell 46:438–454
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2010) Acoustic emotion recognition: a benchmark comparison of performances IEEE Workshop on Automatic Speech Recognition & Understanding IEEE
Shi XJ, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Adv Neural Inform Process Syst 802–810
Simonyan K, Zisserman A (2015) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Neural Information Processing Systems (NIPS)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis ACMM
Vielzeuf V, Pateux S, Jurie F (2017) Temporal multimodal fusion for video emotion classification in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 569–576
Yang X, Molchanov P, Kautz J (2016) Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification In Proceedings of the 2016 ACM on Multimedia Conference ACM 978–987
Yao A, et al. (2016) HoloNet: towards robust emotion recognition in the wild Proceedings of the 18th ACM International Conference on Multimodal Interaction
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8(3):300–313
Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circuit Syst Video Technol 28(10):3030–3043
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning ICLR
Funding
This work is supported by the Fundamental Research Funds for the Central Universities (B200202205), National Nature Science Foundation of China under grants (61501170, 41876097, 61872199).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. The model and framework were designed by Xiao Yao. The analysis and code were performed by Fang Li. The first draft of the manuscript was written by Fang Li, and Yifeng Zeng collected the data and prepared materials . All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yao, X., Li, F. & Zeng, Y. Relational structure predictive neural architecture search for multimodal fusion. Soft Comput 26, 2807–2818 (2022). https://doi.org/10.1007/s00500-022-06772-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-06772-y