Multi-stage Multi-modal Pre-training for Video Representation

Chen, Chunquan; Bao, Lujia; Li, Weikang; Chen, Xiaoshuai; Sun, Xinghai; Qi, Chao

doi:10.1007/978-3-030-88483-3_27

Chunquan Chen¹²,
Lujia Bao¹²,
Weikang Li¹³,
Xiaoshuai Chen¹³,
Xinghai Sun¹³ &
…
Chao Qi¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13029))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1460 Accesses

Abstract

Multi-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from different domains have different distributions and domain differences could be difficult to eliminate in joint training. This paper presents a Multi-Stage Multi-Modal pre-training strategy (MSMM) to train multi-modal joint representation effectively. To eliminate the difficulty of multi-modal end-to-end training, MSMM trains different Uni-modal network separately and then jointly trains multi-modal. After multi-stage pre-training, we can get a better multi-modal joint representation and better uni-modal representations. Meanwhile, we design a multi-modal network and multi-task loss to train the whole network in an end-to-end style. Extensive empirical results show that MSMM can significantly improve the multi-modal model’s performance on the video classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol, S., et al.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
Hershey, S., et al.: Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Google Scholar
Huang, G., Pang, B., Zhu, Z., Rivera, C., Soricut, R.: Multimodal pre-training for dense video captioning. arXiv preprint arXiv:2011.11760 (2020a)
Lan, Z., et al.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Ngiam, J., et al.: Multimodal deep learning. In ICML (2011)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)
Google Scholar
Piergiovanni, A.J., Angelova, A., Toshev, A., Ryoo, M.S.: Evolving space-time neural architectures for videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1793–1802 (2019)
Google Scholar
Qi, F., Yang, X., Xu, C.: A unified framework for multimodal domain adaptation. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 429–437 (2018)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Google Scholar
Silberer, C., Lapata, M.: Learning grounded meaning representations with autoencoders. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 721–732 (2014)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473 (2019a)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to Fine-Tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Post and Telecommunications, Beijing, China
Chunquan Chen & Lujia Bao
Tencent Inc, Beijing, China
Weikang Li, Xiaoshuai Chen, Xinghai Sun & Chao Qi

Authors

Chunquan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lujia Bao
View author publications
You can also search for this author in PubMed Google Scholar
Weikang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinghai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Chao Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunquan Chen .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Lu Wang
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong
Tianjin University, Tianjin, China
Ruifang He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C., Bao, L., Li, W., Chen, X., Sun, X., Qi, C. (2021). Multi-stage Multi-modal Pre-training for Video Representation. In: Wang, L., Feng, Y., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2021. Lecture Notes in Computer Science(), vol 13029. Springer, Cham. https://doi.org/10.1007/978-3-030-88483-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-88483-3_27
Published: 06 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88482-6
Online ISBN: 978-3-030-88483-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)