Abstract
It is commonly seen that the imperfect multi-modal data with missing modality appears in realistic application scenarios, which usually break the data completeness assumption of multi-modal analysis. Therefore, large efforts in multi-modal learning communities have been made on the robust solution for modality-missing data. Recently, pre-trained models based on Mixture-of-Modality-Experts (MoME) Transformers have been proposed, which achieved competitive performance in various downstream tasks, by utilizing different experts of feed-forward networks for single/multi modal inputs. One natural question arises: are Mixture-of-Modality-Experts Transformers robust to missing modality? To that end, in this paper, we conduct a deep investigation on MoME Transformer under the missing modality problem. Specifically, we propose a novel multi-task learning strategy, which leverages a uniform model to handle missing modalities during training and inference. In this way, the MoME Transformer will be empowered with robustness to missing modality. To validate the effectiveness of our proposed method, we conduct extensive experiments on three popular datasets, which indicate our method could outperform the state-of-the-art (SOTA) methods with a large margin.
References
Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017)
Bao, H., et al.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Gallo, I., Ria, G., Landro, N., Grassa, R.L.: Image and text fusion for UPMC food-101 using BERT and CNNs. In: International Conference on Image and Vision Computing New Zealand (IVCNZ 2020), pp. 1–6 (2020)
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural. Inf. Process. Syst. 33, 2611–2624 (2020)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Lee, Y.L., Tsai, Y.H., Chiu, W.C., Lee, C.Y.: Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952 (2023)
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937 (2019)
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Luo, Z., Hsieh, J.-T., Jiang, L., Niebles, J.C., Fei-Fei, L.: Graph distillation for action detection with privileged modalities. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 174–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_11
Ma, M., Ren, J., Zhao, L., Testuggine, D., Peng, X.: Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186 (2022)
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: SMIL: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Sun, L., Xia, C., Yin, W., Liang, T., Philip, S.Y., He, L.: Mixup-transformer: dynamic data augmentation for NLP tasks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440 (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111 (2019)
Vermaa, V., et al.: Interpolation consistency training for semi-supervised learning. Neural Netw. 145, 90–106 (2022)
Wang, W., et al.: Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4346–4350 (2020)
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2020)
Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia. MM 2021, pp. 4400–4407, New York, NY, USA. Association for Computing Machinery (2021)
Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1545–1554 (2022)
Zhang, B., Fang, Y., Ren, T., Wu, G.: Multimodal analysis for deep video understanding with video language transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7165–7169 (2022)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Gao, Y., Xu, T., Chen, E. (2024). Are Mixture-of-Modality-Experts Transformers Robust to Missing Modality During Training and Inferring?. In: Shi, Z., Torresen, J., Yang, S. (eds) Intelligent Information Processing XII. IIP 2024. IFIP Advances in Information and Communication Technology, vol 703. Springer, Cham. https://doi.org/10.1007/978-3-031-57808-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-57808-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57807-6
Online ISBN: 978-3-031-57808-3
eBook Packages: Computer ScienceComputer Science (R0)