Abstract
Multi-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from different domains have different distributions and domain differences could be difficult to eliminate in joint training. This paper presents a Multi-Stage Multi-Modal pre-training strategy (MSMM) to train multi-modal joint representation effectively. To eliminate the difficulty of multi-modal end-to-end training, MSMM trains different Uni-modal network separately and then jointly trains multi-modal. After multi-stage pre-training, we can get a better multi-modal joint representation and better uni-modal representations. Meanwhile, we design a multi-modal network and multi-task loss to train the whole network in an end-to-end style. Extensive empirical results show that MSMM can significantly improve the multi-modal model’s performance on the video classification task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Antol, S., et al.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Hershey, S., et al.: Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Huang, G., Pang, B., Zhu, Z., Rivera, C., Soricut, R.: Multimodal pre-training for dense video captioning. arXiv preprint arXiv:2011.11760 (2020a)
Lan, Z., et al.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Ngiam, J., et al.: Multimodal deep learning. In ICML (2011)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)
Piergiovanni, A.J., Angelova, A., Toshev, A., Ryoo, M.S.: Evolving space-time neural architectures for videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1793–1802 (2019)
Qi, F., Yang, X., Xu, C.: A unified framework for multimodal domain adaptation. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 429–437 (2018)
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Silberer, C., Lapata, M.: Learning grounded meaning representations with autoencoders. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 721–732 (2014)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473 (2019a)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to Fine-Tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, C., Bao, L., Li, W., Chen, X., Sun, X., Qi, C. (2021). Multi-stage Multi-modal Pre-training for Video Representation. In: Wang, L., Feng, Y., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2021. Lecture Notes in Computer Science(), vol 13029. Springer, Cham. https://doi.org/10.1007/978-3-030-88483-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-88483-3_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88482-6
Online ISBN: 978-3-030-88483-3
eBook Packages: Computer ScienceComputer Science (R0)