Relational structure predictive neural architecture search for multimodal fusion

Yao, Xiao; Li, Fang; Zeng, Yifeng

doi:10.1007/s00500-022-06772-y

Relational structure predictive neural architecture search for multimodal fusion

Data analytics and machine learning
Published: 03 February 2022

Volume 26, pages 2807–2818, (2022)
Cite this article

Soft Computing Aims and scope Submit manuscript

Xiao Yao¹,
Fang Li¹ &
Yifeng Zeng¹

326 Accesses
1 Citation
Explore all metrics

Abstract

Design strategies of model architecture greatly affect the performance of tasks for multimodal classification. Neural network architectures in traditional models are designed manually, depending on human understanding for specific tasks, and generalization capability is limited. This paper mainly discusses exploring the optimal architecture for multimodal fusion using Neural Architecture Search. Neural architecture search relies on a controller to generate better architectures and predict the accuracy of given architectures. However, the controller evaluation for architectures is very time-consuming. We discuss a semi-supervised strategy for architectures evaluation to reduce the search time complexity; however, the performance degradation for the predictor is caused. A method for relational-graphic-predictive NAS (RGNAS) is therefore presented to compensate the insufficiency of labeled architectures for improving the accuracy of the predictor. RGNAS leverages the intrinsic relationship between labeled architectures and abundant unlabeled architectures to compensate the insufficiency of labeled architectures. A reasonable trade-off between accuracy and the search time complexity is achieved. We validate the effectiveness of the proposed method on different multimodal datasets (eNTERFACE05, AFEW9.0 and MM-IMDb). Extensive experiments demonstrate that our method outperforms the state of the arts and achieves better robustness and generalization performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

Article Open access 06 June 2023

Xiao Wang, Guangyao Chen, … Wen Gao

MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

MUST Augment: Efficient Augmentation with Multi-stage Stochastic Strategy

Data availibility statement

eNTERFACE05 dataset included in the manuscript can be found at: http://www.enterface.net/results/. AFEW9.0 dataset included in the manuscript can be found at: https://sites.google.com/site/emotiwchallenge/. MM-IMDb dataset included in the manscript can be found at: http://lisi1.unal.edu.co/mmimdb/. All other data are available from the authors upon reasonable request.

References

Acharya D, Huang Z, Paudel DP, Gool LV (2018) Covariance Pooling for Facial Expression Recognition 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis In International Conference on Machine Learning 1247–1255
Arevalo J, Solorio T, Montes-y Gómez M, González FA (2017) Gated multimodal units for information fusion ICLR Worshop
Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Bejani M, Gharavian Davood (2014) Audiovisual emotion recognition using anova feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412
Article Google Scholar
Breiman L (1996) Bagging predicators. Mach Learn 24(2):123–140
MATH Google Scholar
Chen K (2019) The research of multimodal fusing based emotion recognition
Chen M, Wang S, Liang PP, Baltrušaitis Tadas, Zadeh A, Morency LP (2018) Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning In Proceedings of the 19th ACM International Conference on Multimodal Interaction ACM
Cohen J, Cohen JW, Cohen J, Cohen J, Cohen J, Cohen J et al (1988) Statistical power analysis for the behavioral science. Technometrics 31(4):499–500
MATH Google Scholar
Ejegwa P, Wen S, Feng Y, Zhang W, Tang N (2021) Novel Pythagorean fuzzy correlation measures via Pythagorean fuzzy deviation, variance and covariance with applications to pattern recognition and career placement. IEEE Trans Fuzzy Syst (99):1–1
Fan Y, Lu X, Li D, et al (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction 445–450
Feng Y, Yang X, Song Q, Cao J (2018) Synchronization of memristive neural networks with mixed delays via quantized intermittent control. Appl Math Comput 339:874–887
MathSciNet MATH Google Scholar
Feng Y, Zhang W, Xiong J, Li H, Rutkowski L (2020) Event-Triggering Interaction Scheme for Discrete-Time Decentralized Optimization With Nonuniform Step Sizes. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2019.2963330
Goodfellow I, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning, Proceedings of Machine Learning Research, pp 1319–1327
Guo Y, Zhang L, Hu Y, He X, Gao J (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition
Kang M, Ji K, Leng X, Lin Z (2017) Contextual region-based convolutional neural network with multilayer fusion for sar ship detection. Remote Sens 9(8):860
Article Google Scholar
Kim DH, Song BC (2017) Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 529–535
Li F, Neverova N, Wolf C, Taylor G (2016) Modout: learning to fuse modalities via stochastic regularization. J Comput Vision Imag Syst 2(1)
Liu H, Simonyan K, Yang Y (2018) Darts: differentiable architecture search
Liu Y, Yuan Z, Zhou W, Li H (2019) Spatial and temporal mutual promotion for video-based person re-identification Proceedings of the AAAI Conference on Artificial Intelligence 33:8786–8793
Liu C, Zoph B, Shlens J, Hua W, Fei-Fei L, Yuille A, Huang J, Murphy K (2018) Progressive neural architecture search ECCV
Luo R, Tan X, Wang R, Qin T, Chen E, Liu TY (2020) Semi-supervised neural architecture search
Luo R, Tian F, Qin T, Liu T-Y (2018) Neural architecture optimization arXiv preprint arXiv:1808.07233
Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimedia Tools Appl 49(2):277–297
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Neverova N, Wolf C, Taylor G, Nebout F (2014) Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions Pattern Anal Mach Intell 38(8):1692–1706
Article Google Scholar
Onasanya BO, Wen S, et al (2021) Fuzzy coefficient of impulsive intensity in a nonlinear impulsive control system. Neural Process Lett 53:4639–4657
Ouyang X, Kawaai S, Goh EGH, Shen S, Huang DY (2017) Audio-visual emotion recognition using deep transfer learning and multiple temporal models In the 19th ACM International Conference ACM
Pérez-Rúa JM, Baccouche M, Pateux S (2018) Efficient progressive neural architecture search BMVC
Pham H, Guan MY, Zoph B, Le QV, Dean J (2018) Efficient neural architecture search via parameter sharing ICML
Sarvestani RR, Boostani R (2017) Ff-skpcca: kernel probabilistic canonical correlation analysis. Appl Intell 46:438–454
Article Google Scholar
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2010) Acoustic emotion recognition: a benchmark comparison of performances IEEE Workshop on Automatic Speech Recognition & Understanding IEEE
Shi XJ, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Adv Neural Inform Process Syst 802–810
Simonyan K, Zisserman A (2015) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Neural Information Processing Systems (NIPS)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis ACMM
Vielzeuf V, Pateux S, Jurie F (2017) Temporal multimodal fusion for video emotion classification in the wild In Proceedings of the 19th ACM International Conference on Multimodal Interaction 569–576
Yang X, Molchanov P, Kautz J (2016) Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification In Proceedings of the 2016 ACM on Multimedia Conference ACM 978–987
Yao A, et al. (2016) HoloNet: towards robust emotion recognition in the wild Proceedings of the 18th ACM International Conference on Multimodal Interaction
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8(3):300–313
Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circuit Syst Video Technol 28(10):3030–3043
Article Google Scholar
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning ICLR

Download references

Funding

This work is supported by the Fundamental Research Funds for the Central Universities (B200202205), National Nature Science Foundation of China under grants (61501170, 41876097, 61872199).

Author information

Authors and Affiliations

The College of IoT Engineering, Hohai University, Changzhou, 213022, China
Xiao Yao, Fang Li & Yifeng Zeng

Authors

Xiao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Fang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. The model and framework were designed by Xiao Yao. The analysis and code were performed by Fang Li. The first draft of the manuscript was written by Fang Li, and Yifeng Zeng collected the data and prepared materials . All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fang Li.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, X., Li, F. & Zeng, Y. Relational structure predictive neural architecture search for multimodal fusion. Soft Comput 26, 2807–2818 (2022). https://doi.org/10.1007/s00500-022-06772-y

Download citation

Accepted: 10 January 2022
Published: 03 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00500-022-06772-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relational structure predictive neural architecture search for multimodal fusion

Abstract

Access this article

Similar content being viewed by others

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

MUST Augment: Efficient Augmentation with Multi-stage Stochastic Strategy

Data availibility statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Relational structure predictive neural architecture search for multimodal fusion

Abstract

Access this article

Similar content being viewed by others

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

MUST Augment: Efficient Augmentation with Multi-stage Stochastic Strategy

Data availibility statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation