A survey of transformer-based multimodal pre-trained modals
Introduction
The way in which something occurs or is experienced is referred to as its modality, which includes audio, video, image, and text [1]. Multimodal data describing the same or related objects has grown exponentially in recent years. As a result, multimodality has emerged as the primary form of information resource [2]. Human cognitive processes are multimodal in nature. The things we see, hear, and talk are all combined for processing and understanding. Multimodal machine learning approaches are more similar to how humans perceive their surroundings. It must create models to comprehend and manage such multimodal data. Because of the heterogeneity of data, the topic of multimodal machine learning presents some unique challenges. In the traditional unimodal domain, for example, representation learning and feature extraction methods have been relatively well developed in the traditional unimodal domain [3], [4], [5], [6], whereas in the multimodal domain, there is a desire to employ few-shot or zero-shot learning due to the scarcity of high-quality labeled multimodal data.
The Transformer [7], which was initially proposed as a sequence-to-sequence model for machine translation, has demonstrated significant performance improvements and gained prominence in deep learning. BERT-bidirectional encoder representations [8] can be regarded as a key milestone in natural language processing (NLP), which utilizes a Transformer network for pre-training on an unlabeled text corpus and achieves state-of-the-art (SOTA) performance on 11 downstream tasks. Later research demonstrates that PTMs based on Transformer can achieve breaking performance on a variety of NLP tasks [9]. As a result, Transformer has established itself as the dominant architecture for NLP, particularly in the context of PTMs.The standard pre-training and fine-tuning methodologies are to train the model on a large amount of training data and then fine-tune it on smaller task-specific datasets for downstream tasks. The pre-training step allows the model to gain the generic representation, which is helpful for downstream tasks.
The breakthrough of Transformer-based PTMs in NLP has inspired academic interest in the convergence of several modalities, such as video and text or image and text [10]. Multimodal PTMs based on Transformer structure can learn semantic correspondence between different modalities by pre-training on large amounts of unlabeled data and then fine-tuning on small amounts of labeled data [11]. Depending on the modalities employed, the majority of these cross-modal works can be further classified as image-text-based tasks, video-text-based tasks, or audio-text-based tasks. For instance, in image-text-based tasks, the model is intended to associate the word ”cat” in the caption text with the ”cat” in the image. To accomplish this goal, a carefully designed PTM is required, allowing the model to investigate the relationship between different modalities.
In this survey, we are aiming to give a comprehensive review of the existing Transformer-based multimodal PTMs. As shown in Fig. 1, we categorize existing multimodal PTMs into document layout domains, vision-text-based domains, and audio-text-based domains based on differences in applications and downstream tasks, input feature embedding, pre-train tasks, and model architecture. It is worth mentioning that document layout, also known as Document AI, or Document Intelligence, is a relatively recent research field that integrates layout spatial information, image information, and text information [12]. To the best of our knowledge, this survey is the first to introduce the related PTM research progress in this multimodal domain.
The rest of this paper is organized as follows: Section 2 begins with a comparison with previous surveys and highlight the unique contribution of our work. Section 3 gives a comprehensive comparison of prevalent multimodal tasks with their corresponding datasets. Section 4 Document layout based domain, 5 Vision-text based domain, 6 Audio-text based domain summarize the primary existing methods in various multimodal domains (document layout, vision + text, audio + text) in terms of downstream applications, input feature embeddings, pre-training tasks, and model architectures. Section 7 discusses the challenges and future work, while Section 8 concludes this paper.
Section snippets
Comparison with previous surveys
In recent two years, a number of survey papers of multimodal learning or pre-trained models (as listed in Table 1) have provided good overviews of the progress of this sub-field. In this part, we compare our work to earlier surveys to emphasize its distinct contribution. Specifically, several works focus on the overall perspective in multimodal topics, such as representation, fusion strategies, application orientation methods and so on. Soleymani et al. [13] discuss current advances in
Comparison of multimodal tasks with related datasets
The size and quality of datasets have a significant effect on PTM performance. On the basis of the position of the tasks in the PTM pipeline, the multimodal tasks can be divided into pre-training and downstream tasks. They could also be divided into groups according to the modality involved. Pre-training tasks are typically focused on representational alignments between different modalities, necessitating vast amounts of paired data from multimodalities.
Most of the document layout multimodal
Document layout based domain
Document AI, also known as Document Intelligence, refers to techniques for automatically reading, comprehending, and analyzing documents[12]. Fig. 2 depicts a variety of document templates, including purchase orders, financial reports, contracts, invoices, receipts, and many others. Documents are structured in a variety of ways, including plain text, multi-column layouts, and a range of figures/forms/tables, making them difficult to interpret [87].
The multimodal Transformer accepts input from
Vision-text based domain
Vision-linguistic (cross-modal) pre-training is a new research area that has triggered considerable interest in recent years due to its strong ability to transfer knowledge. Vision-related multimodal tasks typically involve text and image modalities (such as Unicoder-VL [48], Image-BERT [50], VL-BERT [51], Pixel-BERT [61], and many others), or text and video modalities (such as CBT [104], UniVL [101], and ActBERT [102]). Given that images and videos are both parts of the vision domain, we
Audio-text based domain
In addition to vision-text, recent works have also focused on the audio and text multimodal communities. Two audio-related multimodal pre-training lines are available. One type intends to extend Transformer for audio-related applications such as speech recognition, speech synthesis, speech enhancement, and music generation [17]. This type of work typically does not make use of audio-text cross-modality fusion and thus falls outside the scope of this survey. Another type of work attempts to
Future discussions
Despite the fact that current multimodal PTMs have made significant progress on a variety of tasks in recent years, there are still challenges in the following areas, and future research may improve.
The first challenge is to see if PTM enhancements can be made across more modalities with enhanced cross-modal representations. According to our survey and others [10], [11], existing work focuses on at most two modalities of cross-modal learning, such as text-vision or text-audio. Recent work [183]
Conclusion
In this survey paper, we present a comprehensive review of recent research works on Transformer based multimodal PTMs. This paper covers various PTMs in different multimodal domains. These models are built on the top of vision-text, document layout and audio-text as input. We summarize these PTMs and classify downstream tasks, datasets, input feature embeddings, pre-training tasks, and model architectures to gain a deeper understanding of these efforts. Finally, we conclude by highlighting some
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Xue Han, IEEE Senior member. She received the Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences. She is currently a research scientist with AI Center of China Mobile Research Institute. Her research interests include NLP and multimodal fusion technology.
References (212)
- et al.
Shape recognition based on neural networks trained by differential evolution algorithm
Neurocomputing
(2007) - et al.
Feature extraction using constrained maximum variance mapping
Pattern Recogn.
(2008) - et al.
A novel full structure optimization algorithm for radial basis probabilistic neural networks
Neurocomputing
(2006) - et al.
A survey of multimodal sentiment analysis
Image Vis. Comput.
(2017) - et al.
Multimodal machine learning: A survey and taxonomy
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) Radial basis probabilistic neural networks: Model and application
Int. J. Pattern Recognit Artif Intell.
(1999)- et al.
A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks
IEEE Trans. Neural Networks
(2008) - et al.
Attention is all you need
Proceedings of NeurIPS
(2017) - J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language...
- et al.
Pre-trained models for natural language processing: A survey
Sci. China Technol. Sci.
(2020)
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets, Springer The Visual Computer June (10)
Layoutlm: Pre-training of text and layout for document image understanding
Deep multimodal representation learning: A survey
IEEE Access
Multimodal intelligence: Representation learning, information fusion, and applications
IEEE J. Selected Top. Signal Process.
A multimodal memes classification: A survey and open research issues
A comprehensive survey on multimodal data representation and information fusion algorithms
Corpus of spontaneous japanese: Its design and evaluation
ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition
Balanced corpus of contemporary written japanese
Language resources and evaluation
A large-scale chinese multimodal ner dataset with speech clues
Odsqa: Open-domain spoken question answering dataset, in: 2018 IEEE Spoken Language Technology Workshop (SLT)
IEEE
Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph
Leveraging unpaired text data for training end-to-end speech-to-intent systems
Speech-language pre-training for end-to-end spoken language understanding
Leveraging acoustic and linguistic embeddings from pretrained speech and language models for intent classification
The nxt-format switchboard corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue
Language resources and evaluation
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Devlbert: Learning deconfounded visio-linguistic representations
Large-scale pretraining for visual dialog: A simple state-of-the-art baseline
European Conference on Computer Vision, Springer
Cited by (18)
Transformer-based meta learning method for bearing fault identification under multiple small sample conditions
2024, Mechanical Systems and Signal ProcessingMulti-modal anchor adaptation learning for multi-modal summarization
2024, NeurocomputingExploring deep learning approaches for video captioning: A comprehensive review
2023, e-Prime - Advances in Electrical Engineering, Electronics and EnergyArtificial intelligence accelerates multi-modal biomedical process: A Survey
2023, Neurocomputing
Xue Han, IEEE Senior member. She received the Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences. She is currently a research scientist with AI Center of China Mobile Research Institute. Her research interests include NLP and multimodal fusion technology.
Yitong Wang is currently studying for a master’s degree in Beijing University of Posts and Telecommunications, and contributed to this paper during her internship at China Mobile Research Institute. Her interest focuses on information extraction.
Junlan Feng, Ph.D, IEEE Fellow, LFN Board Chair, and China AIIA vice Chair. Dr. Feng is now the Chief Scientist & General Manager of AI and Intelligence Operation R&D Center of China Mobile Research Institute. She has chaired and organized multiple conferences and journals in the fields of data mining, speech, and natural language.
Chao Deng received the M.S. degree and the Ph.D. degree from Harbin Institute of Technology, Harbin, China, in 2003 and 2009 respectively. He is currently a deputy general manager with AI Center of China Mobile Research Institute. His research interests include artificial intelligence for ICT operations.
Zhan-Heng Chen received his Ph.D degree with the University of Chinese Academy of Sciences, China. His current research interests include data mining, natural language processing, bioinformatics, machine learning and pattern identification. He has authored over 40 research publications in these areas (Published in Briefings in Bioinformatics, Molecular Therapy-Nucleic Acids, iScience, Communications Biology, BMC Genomics, BMC Systems Biology, Frontiers in Genetics, International Journal of Molecular Sciences, Journal of Cellular and Molecular Medicine, and so on.), and international conferences (such as ICIBM, ICIC).
Yu-An Huang received a PhD degree from the department of Computing, the Hong Kong Polytechnic University, Hong Kong, in 2020. Currently, he is an associate professor at the School of Computer Science, Northwestern Polytechnical University. His current research interests mainly focus on data mining algorithms and applications.
Su Hui, Senior Algorithm Researcher of Tencent, working in WeChat AI Pattern Recognition Center. His main research interests are Dialogue systems, Text Summarization, and General Natural Language Processing models; He is also responsible for the product implementation of dialogue systems and WeChat security, etc. He has published more than 20 papers in ACL, EMNLP, AAAI and other international famous conferences with more than 1200 citations. He received his Master of Engineering degree from Institute of Software, Chinese Academy of Sciences in 2018.
Lun Hu received the B.Eng. degree from the Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, China, in 2006, and the M.Sc. and Ph.D. degrees from the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, in 2008 and 2015, respectively. He joined the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China in 2020 as a Professor of computer science. His research interests include Machine Learning, Complex Network Analytics and their applications in Bioinformatics.
Pengwei Hu, Leading Scientist in the Science & Technology Office at Merck. He received his Ph.D. from the Department of Computing, The Hong Kong Polytechnic University in 2018. Dr. Hu’s main research interest is in machine learning, including AI healthcare and biomedical informatics. Specifically, he is interested in what kind of person will have what kind of disease and the interplay between disease and gene expression. He has authored more than 60 articles in the above areas. He is currently an Associate Editor of the BMC Medical Informatics and Decision Making. He is also served as an editor the Computational and Mathematical Methods, Frontiers in Neurorobotics, and Frontiers in Medicine.