Skip to main content

Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?

  • Conference paper
  • First Online:
Intelligent Information Processing XII (IIP 2024)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 703))

Included in the following conference series:

  • 50 Accesses

Abstract

It is commonly seen that the imperfect multi-modal data with missing modality appears in realistic application scenarios, which usually break the data completeness assumption of multi-modal analysis. Therefore, large efforts in multi-modal learning communities have been made on the robust solution for modality-missing data. Recently, pre-trained models based on Mixture-of-Modality-Experts (MoME) Transformers have been proposed, which achieved competitive performance in various downstream tasks, by utilizing different experts of feed-forward networks for single/multi modal inputs. One natural question arises: are Mixture-of-Modality-Experts Transformers robust to missing modality? To that end, in this paper, we conduct a deep investigation on MoME Transformer under the missing modality problem. Specifically, we propose a novel multi-task learning strategy, which leverages a uniform model to handle missing modalities during training and inference. In this way, the MoME Transformer will be empowered with robustness to missing modality. To validate the effectiveness of our proposed method, we conduct extensive experiments on three popular datasets, which indicate our method could outperform the state-of-the-art (SOTA) methods with a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017)

  2. Bao, H., et al.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)

    Google Scholar 

  3. Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)

    Google Scholar 

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  6. Gallo, I., Ria, G., Landro, N., Grassa, R.L.: Image and text fusion for UPMC food-101 using BERT and CNNs. In: International Conference on Image and Vision Computing New Zealand (IVCNZ 2020), pp. 1–6 (2020)

    Google Scholar 

  7. Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural. Inf. Process. Syst. 33, 2611–2624 (2020)

    Google Scholar 

  8. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)

    Google Scholar 

  9. Lee, Y.L., Tsai, Y.H., Chiu, W.C., Lee, C.Y.: Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952 (2023)

    Google Scholar 

  10. Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019)

    Google Scholar 

  11. Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP (2020)

    Google Scholar 

  12. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  13. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  14. Luo, Z., Hsieh, J.-T., Jiang, L., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 174–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_11

    Chapter  Google Scholar 

  15. Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)

    Google Scholar 

  16. Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: SMIL: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)

    Google Scholar 

  17. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  18. Sun, L., Xia, C., Yin, W., Liang, T., Philip, S.Y., He, L.: Mixup-transformer: dynamic data augmentation for NLP tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440 (2020)

    Google Scholar 

  19. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)

    Google Scholar 

  20. Vermaa, V., et al.: Interpolation consistency training for semi-supervised learning. Neural Netw. 145, 90–106 (2022)

    Article  Google Scholar 

  21. Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)

    Google Scholar 

  22. Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4346–4350 (2020)

    Google Scholar 

  23. Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2020)

    Article  Google Scholar 

  24. Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia. MM 2021, pp. 4400–4407, New York, NY, USA. Association for Computing Machinery (2021)

    Google Scholar 

  25. Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1545–1554 (2022)

    Google Scholar 

  26. Zhang, B., Fang, Y., Ren, T., Wu, G.: Multimodal analysis for deep video understanding with video language transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7165–7169 (2022)

    Google Scholar 

  27. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gao, Y., Xu, T., Chen, E. (2024). Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?. In: Shi, Z., Torresen, J., Yang, S. (eds) Intelligent Information Processing XII. IIP 2024. IFIP Advances in Information and Communication Technology, vol 703. Springer, Cham. https://doi.org/10.1007/978-3-031-57808-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57808-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57807-6

  • Online ISBN: 978-3-031-57808-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics