Elsevier

Neurocomputing

Volume 515, 1 January 2023, Pages 89-106
Neurocomputing

A survey of transformer-based multimodal pre-trained modals

https://doi.org/10.1016/j.neucom.2022.09.136Get rights and content

Highlights

  • Multimodal Pre-trained models with document layout, vision-text and audio-text domains as input.

  • Collection of common multimodal downstream applications with related datasets.

  • Modality feature embedding strategies.

  • Cross-modality alignment pre-training tasks for different multimodal domains.

  • Variations of the audio-text cross-modal learning architecture.

Abstract

With the broad industrialization of Artificial Intelligence(AI), we observe a large fraction of real-world AI applications are multimodal in nature in terms of relevant data and ways of interaction. Pre-trained big models have been proven as the most effective framework for joint modeling of multi-modality data. This paper provides a thorough account of the opportunities and challenges of Transformer-based multimodal pre-trained model (PTM) in various domains. We begin by reviewing the representative tasks of multimodal AI applications, ranging from vision-text and audio-text fusion to more complex tasks such as document layout understanding. We particularly address the new multi-modal research domain of document layout understanding. We further analyze and compare the state-of-the-art Transformer-based multimodal PTMs from multiple aspects, including downstream applications, datasets, input feature embedding, and model architectures. In conclusion, we summarize the key challenges of this field and suggest several future research directions.

Introduction

The way in which something occurs or is experienced is referred to as its modality, which includes audio, video, image, and text [1]. Multimodal data describing the same or related objects has grown exponentially in recent years. As a result, multimodality has emerged as the primary form of information resource [2]. Human cognitive processes are multimodal in nature. The things we see, hear, and talk are all combined for processing and understanding. Multimodal machine learning approaches are more similar to how humans perceive their surroundings. It must create models to comprehend and manage such multimodal data. Because of the heterogeneity of data, the topic of multimodal machine learning presents some unique challenges. In the traditional unimodal domain, for example, representation learning and feature extraction methods have been relatively well developed in the traditional unimodal domain [3], [4], [5], [6], whereas in the multimodal domain, there is a desire to employ few-shot or zero-shot learning due to the scarcity of high-quality labeled multimodal data.

The Transformer [7], which was initially proposed as a sequence-to-sequence model for machine translation, has demonstrated significant performance improvements and gained prominence in deep learning. BERT-bidirectional encoder representations [8] can be regarded as a key milestone in natural language processing (NLP), which utilizes a Transformer network for pre-training on an unlabeled text corpus and achieves state-of-the-art (SOTA) performance on 11 downstream tasks. Later research demonstrates that PTMs based on Transformer can achieve breaking performance on a variety of NLP tasks [9]. As a result, Transformer has established itself as the dominant architecture for NLP, particularly in the context of PTMs.The standard pre-training and fine-tuning methodologies are to train the model on a large amount of training data and then fine-tune it on smaller task-specific datasets for downstream tasks. The pre-training step allows the model to gain the generic representation, which is helpful for downstream tasks.

The breakthrough of Transformer-based PTMs in NLP has inspired academic interest in the convergence of several modalities, such as video and text or image and text [10]. Multimodal PTMs based on Transformer structure can learn semantic correspondence between different modalities by pre-training on large amounts of unlabeled data and then fine-tuning on small amounts of labeled data [11]. Depending on the modalities employed, the majority of these cross-modal works can be further classified as image-text-based tasks, video-text-based tasks, or audio-text-based tasks. For instance, in image-text-based tasks, the model is intended to associate the word ”cat” in the caption text with the ”cat” in the image. To accomplish this goal, a carefully designed PTM is required, allowing the model to investigate the relationship between different modalities.

In this survey, we are aiming to give a comprehensive review of the existing Transformer-based multimodal PTMs. As shown in Fig. 1, we categorize existing multimodal PTMs into document layout domains, vision-text-based domains, and audio-text-based domains based on differences in applications and downstream tasks, input feature embedding, pre-train tasks, and model architecture. It is worth mentioning that document layout, also known as Document AI, or Document Intelligence, is a relatively recent research field that integrates layout spatial information, image information, and text information [12]. To the best of our knowledge, this survey is the first to introduce the related PTM research progress in this multimodal domain.

The rest of this paper is organized as follows: Section 2 begins with a comparison with previous surveys and highlight the unique contribution of our work. Section 3 gives a comprehensive comparison of prevalent multimodal tasks with their corresponding datasets. Section 4 Document layout based domain, 5 Vision-text based domain, 6 Audio-text based domain summarize the primary existing methods in various multimodal domains (document layout, vision + text, audio + text) in terms of downstream applications, input feature embeddings, pre-training tasks, and model architectures. Section 7 discusses the challenges and future work, while Section 8 concludes this paper.

Section snippets

Comparison with previous surveys

In recent two years, a number of survey papers of multimodal learning or pre-trained models (as listed in Table 1) have provided good overviews of the progress of this sub-field. In this part, we compare our work to earlier surveys to emphasize its distinct contribution. Specifically, several works focus on the overall perspective in multimodal topics, such as representation, fusion strategies, application orientation methods and so on. Soleymani et al. [13] discuss current advances in

Comparison of multimodal tasks with related datasets

The size and quality of datasets have a significant effect on PTM performance. On the basis of the position of the tasks in the PTM pipeline, the multimodal tasks can be divided into pre-training and downstream tasks. They could also be divided into groups according to the modality involved. Pre-training tasks are typically focused on representational alignments between different modalities, necessitating vast amounts of paired data from multimodalities.

Most of the document layout multimodal

Document layout based domain

Document AI, also known as Document Intelligence, refers to techniques for automatically reading, comprehending, and analyzing documents[12]. Fig. 2 depicts a variety of document templates, including purchase orders, financial reports, contracts, invoices, receipts, and many others. Documents are structured in a variety of ways, including plain text, multi-column layouts, and a range of figures/forms/tables, making them difficult to interpret [87].

The multimodal Transformer accepts input from

Vision-text based domain

Vision-linguistic (cross-modal) pre-training is a new research area that has triggered considerable interest in recent years due to its strong ability to transfer knowledge. Vision-related multimodal tasks typically involve text and image modalities (such as Unicoder-VL [48], Image-BERT [50], VL-BERT [51], Pixel-BERT [61], and many others), or text and video modalities (such as CBT [104], UniVL [101], and ActBERT [102]). Given that images and videos are both parts of the vision domain, we

Audio-text based domain

In addition to vision-text, recent works have also focused on the audio and text multimodal communities. Two audio-related multimodal pre-training lines are available. One type intends to extend Transformer for audio-related applications such as speech recognition, speech synthesis, speech enhancement, and music generation [17]. This type of work typically does not make use of audio-text cross-modality fusion and thus falls outside the scope of this survey. Another type of work attempts to

Future discussions

Despite the fact that current multimodal PTMs have made significant progress on a variety of tasks in recent years, there are still challenges in the following areas, and future research may improve.

The first challenge is to see if PTM enhancements can be made across more modalities with enhanced cross-modal representations. According to our survey and others [10], [11], existing work focuses on at most two modalities of cross-modal learning, such as text-vision or text-audio. Recent work [183]

Conclusion

In this survey paper, we present a comprehensive review of recent research works on Transformer based multimodal PTMs. This paper covers various PTMs in different multimodal domains. These models are built on the top of vision-text, document layout and audio-text as input. We summarize these PTMs and classify downstream tasks, datasets, input feature embeddings, pre-training tasks, and model architectures to gain a deeper understanding of these efforts. Finally, we conclude by highlighting some

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xue Han, IEEE Senior member. She received the Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences. She is currently a research scientist with AI Center of China Mobile Research Institute. Her research interests include NLP and multimodal fusion technology.

References (212)

  • K. Bayoudh et al.

    A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets, Springer The Visual Computer June (10)

    (2021)
  • X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, L. Zhang, W. Han, M. Huang, et al., Pre-trained models: Past,...
  • Y. Xu et al.

    Layoutlm: Pre-training of text and layout for document image understanding

  • W. Guo et al.

    Deep multimodal representation learning: A survey

    IEEE Access

    (2019)
  • C. Zhang et al.

    Multimodal intelligence: Representation learning, information fusion, and applications

    IEEE J. Selected Top. Signal Process.

    (2020)
  • T.H. Afridi et al.

    A multimodal memes classification: A survey and open research issues

  • T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers, arXiv preprint...
  • S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, M. Shah, Transformers in vision: A survey, arXiv preprint...
  • A. Gaonkar et al.

    A comprehensive survey on multimodal data representation and information fusion algorithms

  • L. Ruan, Q. Jin, Survey: Transformer based video-language pre-training, arXiv preprint...
  • K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on vision...
  • P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, arXiv preprint...
  • L. Lugosch, M. Ravanelli, P. Ignoto, V.S. Tomar, Y. Bengio, Speech model pre-training for end-to-end spoken language...
  • K. Maekawa

    Corpus of spontaneous japanese: Its design and evaluation

    ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition

    (2003)
  • H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, T. Kawahara, Distilling the knowledge of bert for...
  • K. Maekawa et al.

    Balanced corpus of contemporary written japanese

    Language resources and evaluation

    (2014)
  • D. Sui et al.

    A large-scale chinese multimodal ner dataset with speech clues

  • C.-H. Lee et al.

    Odsqa: Open-domain spoken question answering dataset, in: 2018 IEEE Spoken Language Technology Workshop (SLT)

    IEEE

    (2018)
  • V. Raina, M.J. Gales, An initial investigation of non-native spoken question-answering, arXiv preprint...
  • Y.-A. Chung, C. Zhu, M. Zeng, Splat: Speech-language joint pre-training for spoken language understanding, arXiv...
  • C.-H. Li, S.-L. Wu, C.-L. Liu, H.-Y. Lee, Spoken squad: A study of mitigating the impact of speech recognition errors...
  • Y.-S. Chuang, C.-L. Liu, H.-Y. Lee, L.-S. Lee, Speechbert: An audio-and-text jointly learned language model for...
  • C. You, N. Chen, Y. Zou, Self-supervised contrastive cross-modality representation learning for spoken question...
  • A.B. Zadeh et al.

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

  • V. Goel, H.-K. Kuo, S. Deligne, C. Wu, Language model estimation for optimizing end-to-end performance of a natural...
  • Y. Huang et al.

    Leveraging unpaired text data for training end-to-end speech-to-intent systems

  • Y. Jiang, B. Sharma, M. Madhavi, H. Li, Knowledge distillation from bert transformer to speech transformer for intent...
  • Y. Qian et al.

    Speech-language pre-training for end-to-end spoken language understanding

  • P. Denisov, N.T. Vu, Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal...
  • M. Radfar, A. Mouchtaris, S. Kunzmann, End-to-end neural transformer based spoken language understanding, arXiv...
  • B. Sharma et al.

    Leveraging acoustic and linguistic embeddings from pretrained speech and language models for intent classification

  • P. Price, Evaluation of spoken language systems: The atis domain, in: Speech and Natural Language: Proceedings of a...
  • S. Calhoun et al.

    The nxt-format switchboard corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue

    Language resources and evaluation

    (2010)
  • P. Sharma et al.

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

  • C. Alberti, J. Ling, M. Collins, D. Reitter, Fusion of detected objects in text for visual question answering, arXiv...
  • S. Zhang et al.

    Devlbert: Learning deconfounded visio-linguistic representations

  • V. Murahari et al.

    Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

    European Conference on Computer Vision, Springer

    (2020)
  • G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, Unicoder-vl: A universal encoder for vision and language by cross-modal...
  • Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation...
  • D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti, Imagebert: Cross-modal pre-training with large-scale...
  • Cited by (18)

    • Exploring deep learning approaches for video captioning: A comprehensive review

      2023, e-Prime - Advances in Electrical Engineering, Electronics and Energy
    View all citing articles on Scopus

    Xue Han, IEEE Senior member. She received the Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences. She is currently a research scientist with AI Center of China Mobile Research Institute. Her research interests include NLP and multimodal fusion technology.

    Yitong Wang is currently studying for a master’s degree in Beijing University of Posts and Telecommunications, and contributed to this paper during her internship at China Mobile Research Institute. Her interest focuses on information extraction.

    Junlan Feng, Ph.D, IEEE Fellow, LFN Board Chair, and China AIIA vice Chair. Dr. Feng is now the Chief Scientist & General Manager of AI and Intelligence Operation R&D Center of China Mobile Research Institute. She has chaired and organized multiple conferences and journals in the fields of data mining, speech, and natural language.

    Chao Deng received the M.S. degree and the Ph.D. degree from Harbin Institute of Technology, Harbin, China, in 2003 and 2009 respectively. He is currently a deputy general manager with AI Center of China Mobile Research Institute. His research interests include artificial intelligence for ICT operations.

    Zhan-Heng Chen received his Ph.D degree with the University of Chinese Academy of Sciences, China. His current research interests include data mining, natural language processing, bioinformatics, machine learning and pattern identification. He has authored over 40 research publications in these areas (Published in Briefings in Bioinformatics, Molecular Therapy-Nucleic Acids, iScience, Communications Biology, BMC Genomics, BMC Systems Biology, Frontiers in Genetics, International Journal of Molecular Sciences, Journal of Cellular and Molecular Medicine, and so on.), and international conferences (such as ICIBM, ICIC).

    Yu-An Huang received a PhD degree from the department of Computing, the Hong Kong Polytechnic University, Hong Kong, in 2020. Currently, he is an associate professor at the School of Computer Science, Northwestern Polytechnical University. His current research interests mainly focus on data mining algorithms and applications.

    Su Hui, Senior Algorithm Researcher of Tencent, working in WeChat AI Pattern Recognition Center. His main research interests are Dialogue systems, Text Summarization, and General Natural Language Processing models; He is also responsible for the product implementation of dialogue systems and WeChat security, etc. He has published more than 20 papers in ACL, EMNLP, AAAI and other international famous conferences with more than 1200 citations. He received his Master of Engineering degree from Institute of Software, Chinese Academy of Sciences in 2018.

    Lun Hu received the B.Eng. degree from the Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, China, in 2006, and the M.Sc. and Ph.D. degrees from the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, in 2008 and 2015, respectively. He joined the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China in 2020 as a Professor of computer science. His research interests include Machine Learning, Complex Network Analytics and their applications in Bioinformatics.

    Pengwei Hu, Leading Scientist in the Science & Technology Office at Merck. He received his Ph.D. from the Department of Computing, The Hong Kong Polytechnic University in 2018. Dr. Hu’s main research interest is in machine learning, including AI healthcare and biomedical informatics. Specifically, he is interested in what kind of person will have what kind of disease and the interplay between disease and gene expression. He has authored more than 60 articles in the above areas. He is currently an Associate Editor of the BMC Medical Informatics and Decision Making. He is also served as an editor the Computational and Mathematical Methods, Frontiers in Neurorobotics, and Frontiers in Medicine.

    View full text