research-article

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

Authors:

Hongxia YangAuthors Info & Claims

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 3251 - 3261

https://doi.org/10.1145/3447548.3467206

Published: 14 August 2021 Publication History

Abstract

Multimodal pretraining has demonstrated success in the downstream tasks of cross-modal representation learning. However, it is limited to the English data, and there is still a lack of large-scale dataset for multimodal pretraining in Chinese. In this work, we propose the largest dataset for pretraining in Chinese, which consists of over 1.9TB images and 292GB texts. The dataset has large coverage over domains, including encyclopedia, question answering, forum discussion, etc. Besides, we propose a method called M6, referring to Multi-Modality-to-Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. The model is pretrained with our proposed tasks, including text-to-text transfer, image-to-text transfer, as well as multi-modality-to-text transfer. The tasks endow the model with strong capability of understanding and generation. We scale the model to 10 billion parameters, and build the largest pretrained model in Chinese. Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities, and the 10B-parameter pretrained model demonstrates strong potential in the setting of zero-shot learning.

References

[1]

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. In International Conference on Machine Learning. PMLR, 642--652.

[2]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV 2020. 104--120.

[5]

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).

[6]

Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2018. A span-extraction dataset for chinese machine reading comprehension. arXiv preprint arXiv:1810.07366 (2018).

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019. 4171--4186.

[8]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS 2019. 13042--13054.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[10]

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS 2020 .

[11]

Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020 b. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR 2020. 2251--2260.

Digital Library

[12]

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612 (2015).

[13]

Tianyu Gao, Adam Fisch, and Danqi Chen. 2020 a. Making Pre-trained Language Models Better Few-shot Learners. arXiv preprint arXiv:2012.15723 (2020).

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR 2016. 770--778.

[15]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).

[16]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).

[17]

Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. 2020. Convbert: Improving bert with span-based dynamic convolution. arXiv preprint arXiv:2008.02496 (2020).

[18]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR, Vol. abs/1909.11942 (2019).

[19]

Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. CoRR, Vol. abs/1908.06066 (2019).

[20]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 b. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR, Vol. abs/1908.03557 (2019).

[21]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. CoRR, Vol. abs/2004.06165 (2020).

[22]

Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, and Hongxia Yang. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 (2020).

[23]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV 2014. 740--755.

[24]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs/1907.11692 (2019).

[25]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR 2019 .

[26]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019 a. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS 2019. 13--23.

[27]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019 b. 12-in-1: Multi-Task Vision and Language Representation Learning. CoRR, Vol. abs/1912.02315 (2019).

[28]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/ languageunsupervised/language understanding paper. pdf (2018).

[29]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. [n.d.]. Language models are unsupervised multitask learners. ( [n.,d.]).

[30]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2019. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 (2019).

[31]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv preprint arXiv:2101.06840 (2021).

[32]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS 2015. 91--99.

[33]

Timo Schick and Hinrich Schütze. 2020. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. arXiv preprint arXiv:2009.07118 (2020).

[34]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL 2016 .

[35]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML 2019. 5926--5936.

[36]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR 2020 .

[37]

Maosong Sun, Jingyang Li, Zhipeng Guo, Z Yu, Y Zheng, X Si, and Z Liu. 2016. Thuctc: an efficient chinese text classifier. GitHub Repository (2016).

[38]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP 2019. 5099--5110.

[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS 2017. 5998--6008.

Digital Library

[40]

Ang Wang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, and Wei Lin. 2020. Whale: A Unified Distributed Training Framework. arXiv preprint arXiv:2011.09208 (2020).

[41]

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577 (2019).

[42]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[43]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR 2017. 1492--1500.

[44]

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al. 2020 a. CLUE: A Chinese Language Understanding Evaluation Benchmark. In COLING 2020. 4762--4772.

[45]

Liang Xu, Xuanwei Zhang, and Qianqian Dong. 2020 b. CLUECorpus2020: A Large-scale Chinese Corpus for Pre-trainingLanguage Model. arXiv preprint arXiv:2003.01355 (2020).

[46]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS 2019. 5754--5764.

[47]

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-vil: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020).

[48]

Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, et al. 2020. CPM: A Large-scale Generative Chinese Pre-trained Language Model. arXiv preprint arXiv:2012.00413 (2020).

[49]

Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. In ACL 2019. 778--787.

[50]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI 2020. 13041--13049.

Cited By

Lin JDai XXi YLiu WChen BZhang HLiu YWu CLi XZhu CGuo HYu YTang RZhang W(2025)How Can Recommender Systems Benefit from Large Language Models: A SurveyACM Transactions on Information Systems10.1145/367800443:2(1-47)Online publication date: 18-Jan-2025
https://dl.acm.org/doi/10.1145/3678004
Yang MLiu ZYang LLiu XWang CPeng HYu PAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Unified Pretraining for Recommendation via Task HypergraphsProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635811(891-900)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635811
Zeng YHan NPan KJin Q(2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3310282
Show More Cited By

Index Terms

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Knowledge Perceived Multi-modal Pretraining in E-commerce
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this paper, we address multi-modal pretraining of product data in the field of E-commerce. Current multi-modal pretraining methods proposed for image and text modalities lack robustness in the face of modality-missing and modality-noise, which are ...
Large-scale Multi-Modality Pretrained Models: Applications and Experiences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this talk, we present our experiences and applications of large-scale multi-modality pretrained models, developed at Alibaba and Ant Group. We first present a cross-modal pretraining method called M6 (Multi-Modality to Multi-Modality Multitask Mega-...
Cross-Modality Feature Learning via Convolutional Autoencoder
Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data

Learning robust and representative features across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this article, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

August 2021

4259 pages

ISBN:9781450383325

DOI:10.1145/3447548

General Chairs:
Feida Zhu
Singapore Management University
,
Beng Chin Ooi
National University of Singapore
,
Chunyan Miao
Nanyang Technology University
,
Program Chairs:
Haixun Wang,
Iryna Skrypnyk,
Wynne Hsu,
Sanjay Chawla

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '21

Sponsor:

KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
497
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin JDai XXi YLiu WChen BZhang HLiu YWu CLi XZhu CGuo HYu YTang RZhang W(2025)How Can Recommender Systems Benefit from Large Language Models: A SurveyACM Transactions on Information Systems10.1145/367800443:2(1-47)Online publication date: 18-Jan-2025
https://dl.acm.org/doi/10.1145/3678004
Yang MLiu ZYang LLiu XWang CPeng HYu PAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Unified Pretraining for Recommendation via Task HypergraphsProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635811(891-900)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635811
Zeng YHan NPan KJin Q(2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3310282
Wang SZhu LShi LMo HTan S(2023)A Survey of Full-Cycle Cross-Modal Retrieval: From a Representation Learning PerspectiveApplied Sciences10.3390/app1307457113:7(4571)Online publication date: 4-Apr-2023
https://doi.org/10.3390/app13074571
Huafeng LJingjing CLiang LBingkun BZechao LJiaying LLiqiang N(2023)Cross-modal representation learning and generationJournal of Image and Graphics10.11834/jig.23003528:6(1608-1629)Online publication date: 2023
https://doi.org/10.11834/jig.230035
Manzoor MAlbarri SXian ZMeng ZNakov PLiang S(2023)Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its ApplicationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361783320:3(1-34)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617833
Ding RChen BXie PHuang FLi XZhang QXu YChen HDuh WHuang HKato MMothe JPoblete B(2023)MGeo: Multi-Modal Geographic Language Model Pre-TrainingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591728(185-194)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591728
Liu SLi LSong JYang YZeng XChua TLauw HSi LTerzi ETsaparas P(2023)Multimodal Pre-Training with Self-Distillation for Product Understanding in E-CommerceProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570423(1039-1047)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570423
Lu SZhou CXie KLin JWang Z(2023)Fast and Accurate FSA System Using ELBERT: An Efficient and Lightweight BERTIEEE Transactions on Signal Processing10.1109/TSP.2023.332282571(3821-3834)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TSP.2023.3322825
Xu PZhu XClifton D(2023)Multimodal Learning With Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.327515645:10(12113-12132)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3275156
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten