Impact Statement:Target-oriented multimodal sentiment classification is a new subtask in multimodal sentiment analysis, which identifies the sentiment tendency of a given target word in a...Show More
Abstract:
With the development of fine-grained multimodal sentiment analysis tasks, target-oriented multimodal sentiment (TMSC) analysis has received more attention, which aims to ...Show MoreMetadata
Impact Statement:
Target-oriented multimodal sentiment classification is a new subtask in multimodal sentiment analysis, which identifies the sentiment tendency of a given target word in a sentence based on multimodal information. Existing work focuses on image fine-grained features using transformer-based complex methods, but its requires strong computational power. By comparison, we make the first attempt to design a novel MLP-based lightweight model aiming to obtain precise target-level sentiment classification performance with less computational resources. From the methodology viewpoint, this method adopts rearrangement and restore operations can effectively mix multimodal feature information. Experimental results on two benchmark datasets validate our conclusions, which provide innovative solution perspectives for the TMSC task and motivate follow-up research to some extent.
Abstract:
With the development of fine-grained multimodal sentiment analysis tasks, target-oriented multimodal sentiment (TMSC) analysis has received more attention, which aims to classify the sentiment of target with the help of textual and associated image features. Existing methods focus on exploring fine-grained image features and incorporate transformer-based complex fusion strategies, while ignoring the heavy computational burden. Recently, some lightweight multilayer perceptrons (MLP)-based methods have been successfully applied to multimodal sentiment classification tasks. In this article, we propose an effective rearrangement and restore mixer model (RR-Mixer) for TMSC, which dedicates the interaction of image, text, and targets along the modal-axis, sequential-axis, and feature channel-axis through rearrangement and restore operations. Specifically, we take vision transformer (ViT) and robustly optimized BERT (RoBERTa) pretrained models to extract image and textual features, respective...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 6, June 2024)