Skip to main content
Log in

Relational structure predictive neural architecture search for multimodal fusion

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Design strategies of model architecture greatly affect the performance of tasks for multimodal classification. Neural network architectures in traditional models are designed manually, depending on human understanding for specific tasks, and generalization capability is limited. This paper mainly discusses exploring the optimal architecture for multimodal fusion using Neural Architecture Search. Neural architecture search relies on a controller to generate better architectures and predict the accuracy of given architectures. However, the controller evaluation for architectures is very time-consuming. We discuss a semi-supervised strategy for architectures evaluation to reduce the search time complexity; however, the performance degradation for the predictor is caused. A method for relational-graphic-predictive NAS (RGNAS) is therefore presented to compensate the insufficiency of labeled architectures for improving the accuracy of the predictor. RGNAS leverages the intrinsic relationship between labeled architectures and abundant unlabeled architectures to compensate the insufficiency of labeled architectures. A reasonable trade-off between accuracy and the search time complexity is achieved. We validate the effectiveness of the proposed method on different multimodal datasets (eNTERFACE05, AFEW9.0 and MM-IMDb). Extensive experiments demonstrate that our method outperforms the state of the arts and achieves better robustness and generalization performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availibility statement

eNTERFACE05 dataset included in the manuscript can be found at: http://www.enterface.net/results/. AFEW9.0 dataset included in the manuscript can be found at: https://sites.google.com/site/emotiwchallenge/. MM-IMDb dataset included in the manscript can be found at: http://lisi1.unal.edu.co/mmimdb/. All other data are available from the authors upon reasonable request.

References

  • Acharya D, Huang Z, Paudel DP, Gool LV (2018) Covariance Pooling for Facial Expression Recognition 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE

  • Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis In International Conference on Machine Learning 1247–1255

  • Arevalo J, Solorio T, Montes-y Gómez M, González FA (2017) Gated multimodal units for information fusion ICLR Worshop

  • Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

  • Bejani M, Gharavian Davood (2014) Audiovisual emotion recognition using anova feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412

    Article  Google Scholar 

  • Breiman L (1996) Bagging predicators. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  • Chen K (2019) The research of multimodal fusing based emotion recognition

  • Chen M, Wang S, Liang PP, Baltrušaitis Tadas, Zadeh A, Morency LP (2018) Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning In Proceedings of the 19th ACM International Conference on Multimodal Interaction ACM

  • Cohen J, Cohen JW, Cohen J, Cohen J, Cohen J, Cohen J et al (1988) Statistical power analysis for the behavioral science. Technometrics 31(4):499–500

    MATH  Google Scholar 

  • Ejegwa P, Wen S, Feng Y, Zhang W, Tang N (2021) Novel Pythagorean fuzzy correlation measures via Pythagorean fuzzy deviation, variance and covariance with applications to pattern recognition and career placement. IEEE Trans Fuzzy Syst (99):1–1

  • Fan Y, Lu X, Li D, et al (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction 445–450

  • Feng Y, Yang X, Song Q, Cao J (2018) Synchronization of memristive neural networks with mixed delays via quantized intermittent control. Appl Math Comput 339:874–887

    MathSciNet  MATH  Google Scholar 

  • Feng Y, Zhang W, Xiong J, Li H, Rutkowski L (2020) Event-Triggering Interaction Scheme for Discrete-Time Decentralized Optimization With Nonuniform Step Sizes. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2019.2963330

  • Goodfellow I, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning, Proceedings of Machine Learning Research, pp 1319–1327

  • Guo Y, Zhang L, Hu Y, He X, Gao J (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition

  • Kang M, Ji K, Leng X, Lin Z (2017) Contextual region-based convolutional neural network with multilayer fusion for sar ship detection. Remote Sens 9(8):860

    Article  Google Scholar 

  • Kim DH, Song BC (2017) Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 529–535

  • Li F, Neverova N, Wolf C, Taylor G (2016) Modout: learning to fuse modalities via stochastic regularization. J Comput Vision Imag Syst 2(1)

  • Liu H, Simonyan K, Yang Y (2018) Darts: differentiable architecture search

  • Liu Y, Yuan Z, Zhou W, Li H (2019) Spatial and temporal mutual promotion for video-based person re-identification Proceedings of the AAAI Conference on Artificial Intelligence 33:8786–8793

  • Liu C, Zoph B, Shlens J, Hua W, Fei-Fei L, Yuille A, Huang J, Murphy K (2018) Progressive neural architecture search ECCV

  • Luo R, Tan X, Wang R, Qin T, Chen E, Liu TY (2020) Semi-supervised neural architecture search

  • Luo R, Tian F, Qin T, Liu T-Y (2018) Neural architecture optimization arXiv preprint arXiv:1808.07233

  • Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimedia Tools Appl 49(2):277–297

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  • Neverova N, Wolf C, Taylor G, Nebout F (2014) Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions Pattern Anal Mach Intell 38(8):1692–1706

    Article  Google Scholar 

  • Onasanya BO, Wen S, et al (2021) Fuzzy coefficient of impulsive intensity in a nonlinear impulsive control system. Neural Process Lett 53:4639–4657

  • Ouyang X, Kawaai S, Goh EGH, Shen S, Huang DY (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models In the 19th ACM International Conference ACM

  • Pérez-Rúa JM, Baccouche M, Pateux S (2018) Efficient progressive neural architecture search BMVC

  • Pham H, Guan MY, Zoph B, Le QV, Dean J (2018) Efficient neural architecture search via parameter sharing ICML

  • Sarvestani RR, Boostani R (2017) Ff-skpcca: kernel probabilistic canonical correlation analysis. Appl Intell 46:438–454

    Article  Google Scholar 

  • Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2010) Acoustic emotion recognition: a benchmark comparison of performances IEEE Workshop on Automatic Speech Recognition & Understanding IEEE

  • Shi XJ, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Adv Neural Inform Process Syst 802–810

  • Simonyan K, Zisserman A (2015) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Neural Information Processing Systems (NIPS)

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis ACMM

  • Vielzeuf V, Pateux S, Jurie F (2017) Temporal multimodal fusion for video emotion classification in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 569–576

  • Yang X, Molchanov P, Kautz J (2016) Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification In Proceedings of the 2016 ACM on Multimedia Conference ACM 978–987

  • Yao A, et al. (2016) HoloNet: towards robust emotion recognition in the wild Proceedings of the 18th ACM International Conference on Multimodal Interaction

  • Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8(3):300–313

  • Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circuit Syst Video Technol 28(10):3030–3043

    Article  Google Scholar 

  • Zoph B, Le QV (2016) Neural architecture search with reinforcement learning ICLR

Download references

Funding

This work is supported by the Fundamental Research Funds for the Central Universities (B200202205), National Nature Science Foundation of China under grants (61501170, 41876097, 61872199).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. The model and framework were designed by Xiao Yao. The analysis and code were performed by Fang Li. The first draft of the manuscript was written by Fang Li, and Yifeng Zeng collected the data and prepared materials . All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fang Li.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, X., Li, F. & Zeng, Y. Relational structure predictive neural architecture search for multimodal fusion. Soft Comput 26, 2807–2818 (2022). https://doi.org/10.1007/s00500-022-06772-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-022-06772-y

Keywords

Navigation