Export Citations
On behalf of the organizing committee, it is our distinct pleasure to extend a warm welcome to the LGM3A Workshop. As Chairs of this conference, we are delighted to bring together a community of scholars, researchers, and professionals from diverse backgrounds, all driven by a shared passion for advancing the frontiers of knowledge in our field.
In recent years, the field of large language models has witnessed remarkable growth, with models like GPT, T5, RoBERTa, and BERT transforming our understanding of natural language processing. These models, trained on vast volumes of text data, have empowered us to decode the intricate structures and patterns of human language. Moreover, with the surge in multimodal data-comprising audio, visual, and text-we are now on the cusp of an exciting era where these large generative language models are poised to revolutionize multimodal applications.
The workshop serves as a pivotal platform to delve into this dynamic intersection of language models and multimodal applications. In the era of BLIP, Flamingo, KOSMOS, PaLM-E, LLaVA, Visual ChatGPT, and the eagerly awaited GPT-4, we find ourselves at a juncture where large language models are enabling us to understand and generate responses with unprecedented accuracy and nuance across diverse modalities.
Proceeding Downloads
Large Generative Models Meet Multimodal Video Intelligence
In this talk, I would like to share my recent research around multimodal video intelligence in the era of large generative models. I will first talk about video-language pretraining techniques (All-in-one, EgoVLP) that use one single model to power ...
Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models
Benefitting from unprecedented computational power, massive data throughput, and superhuman memory, large language models (LLMs) are fundamentally transforming multimodal machine learning. An LLM can be analogized to an enormous treasure box guarded by ...
Multi-Modal Generative AI with Foundation Models
Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in ...
NeurSEG: A Segment Driven Deep Neural Model for Nested Named Entity Recognition
Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). Apart from flat entities, nested entities are also commonly existed in real-life textual data. However, the current methods are not capable of handling nested ...
SAT: Self-Attention Control for Diffusion Models Training
Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often ...
Multimodal Data Augmentation for Image Captioning using Diffusion Models
Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation ...
ImEW: A Framework for Editing Image in the Wild
The ability to edit images in a realistic and visually appealing manner is a fundamental requirement in various computer vision applications. In this paper, we present ImEW, a unified framework designed for solving image editing tasks. ImEW utilizes off-...
CGSMP: Controllable Generative Summarization via Multimodal Prompt
- Qian Yong,
- Jueqi Wei,
- YiRen Zhang,
- XiLun Zhang,
- Chao Wei,
- Simiao Chen,
- Yunhe Li,
- Cheng Ye,
- Bing Huang,
- Hao Wang
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of a large language model (LLM), this advancement has resulted in more fluent and coherent Natural Language Generation, which has contributed to ...
Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval
In this work we propose a set of new automatic text augmentations that leverage Large Language Models from song metadata to improve on music information retrieval tasks. Compared to recent works, our proposed methods leverage large language models and ...
Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
In this paper, we introduce Subsampling of frequent Words for Contrastive Language-Image Pre-training (SW-CLIP), a novel approach for the training Vision-Language Models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously ...
Fashion-GPT: Integrating LLMs with Fashion Retrieval System
Customers on a fashion e-commerce platform although expressing their clothing preferences through combined imagery and textual information, they are limited to retrieve with single-round fixed inputs. At the same time, large language models (LLMs) have ...
Index Terms
- Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications