Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?

Gao, Yan; Xu, Tong; Chen, Enhong

doi:10.1007/978-3-031-57808-3_12

Yan Gao¹⁸,
Tong Xu¹⁸ &
Enhong Chen¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 703))

Included in the following conference series:

International Conference on Intelligent Information Processing

50 Accesses

Abstract

It is commonly seen that the imperfect multi-modal data with missing modality appears in realistic application scenarios, which usually break the data completeness assumption of multi-modal analysis. Therefore, large efforts in multi-modal learning communities have been made on the robust solution for modality-missing data. Recently, pre-trained models based on Mixture-of-Modality-Experts (MoME) Transformers have been proposed, which achieved competitive performance in various downstream tasks, by utilizing different experts of feed-forward networks for single/multi modal inputs. One natural question arises: are Mixture-of-Modality-Experts Transformers robust to missing modality? To that end, in this paper, we conduct a deep investigation on MoME Transformer under the missing modality problem. Specifically, we propose a novel multi-task learning strategy, which leverages a uniform model to handle missing modalities during training and inference. In this way, the MoME Transformer will be empowered with robustness to missing modality. To validate the effectiveness of our proposed method, we conduct extensive experiments on three popular datasets, which indicate our method could outperform the state-of-the-art (SOTA) methods with a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017)
Bao, H., et al.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
Google Scholar
Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Gallo, I., Ria, G., Landro, N., Grassa, R.L.: Image and text fusion for UPMC food-101 using BERT and CNNs. In: International Conference on Image and Vision Computing New Zealand (IVCNZ 2020), pp. 1–6 (2020)
Google Scholar
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural. Inf. Process. Syst. 33, 2611–2624 (2020)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Lee, Y.L., Tsai, Y.H., Chiu, W.C., Lee, C.Y.: Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952 (2023)
Google Scholar
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019)
Google Scholar
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP (2020)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Luo, Z., Hsieh, J.-T., Jiang, L., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 174–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_11
Chapter Google Scholar
Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)
Google Scholar
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: SMIL: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)
Google Scholar
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Sun, L., Xia, C., Yin, W., Liang, T., Philip, S.Y., He, L.: Mixup-transformer: dynamic data augmentation for NLP tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440 (2020)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
Google Scholar
Vermaa, V., et al.: Interpolation consistency training for semi-supervised learning. Neural Netw. 145, 90–106 (2022)
Article Google Scholar
Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Google Scholar
Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4346–4350 (2020)
Google Scholar
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2020)
Article Google Scholar
Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia. MM 2021, pp. 4400–4407, New York, NY, USA. Association for Computing Machinery (2021)
Google Scholar
Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1545–1554 (2022)
Google Scholar
Zhang, B., Fang, Y., Ren, T., Wu, G.: Multimodal analysis for deep video understanding with video language transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7165–7169 (2022)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230027, China
Yan Gao, Tong Xu & Enhong Chen

Authors

Yan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Tong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Enhong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Gao .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
University of Oslo, Oslo, Norway
Jim Torresen
De Montfort University, Leicester, UK
Shengxiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Y., Xu, T., Chen, E. (2024). Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?. In: Shi, Z., Torresen, J., Yang, S. (eds) Intelligent Information Processing XII. IIP 2024. IFIP Advances in Information and Communication Technology, vol 703. Springer, Cham. https://doi.org/10.1007/978-3-031-57808-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-57808-3_12
Published: 06 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57807-6
Online ISBN: 978-3-031-57808-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?