Feature-guided Multimodal Sentiment Analysis towards Industry 4.0

https://doi.org/10.1016/j.compeleceng.2022.107961Get rights and content

Highlights

  • Advanced and efficient image-text multimodal fusion approach.

  • Clever use of matrix transformation to achieve alignment of different modal features.

  • Using attention mechanism to ensure the parallelism of the model and improve the training speed.

  • The unimodal and multimodal features are stitched together to ensure the complementarity of the modalities.

  • The pain point problem of difficult to obtain domain datasets is solved by constructing a generic dataset.

Abstarct

Combining Artificial Intelligence (AI) to process rich media information has become an important part of Industry 4.0. Sentiment recognition in AI aims to analyze user emotions contained in rich media to facilitate service enhancement. Previous research on sentiment recognition has mainly focused on academia, and few have discussed algorithmic applications and innovations in industry. In this paper, we propose a general approach for multimodal sentiment recognition for images and text. The method provides a new approach for processing rich media information by fully considering the internal features of each modality itself as well as the correlations between the modalities. In the dataset constructed in this paper, the accuracy rate is improved by more than 4% compared with the method using single modality. The effectiveness and generality of the method in multimodal sentiment recognition is demonstrated by extending the experiments with a multimodal public dataset.

Introduction

As an important part of Artificial Intelligence (AI), the application of emotion recognition in industry aims to tap into users' potential opinion attitudes toward industrial products and services, help discover users' needs and preferences, and improve related services, thus accelerating the development of Industry 4.0. The current research on sentiment recognition is mainly focused on academia, with more research on unimodal aspects. Compared with unimodal data, richer information is hidden in multimodal data. Rich media data in the industry contains multiple modalities, and it is our goal to effectively utilize these modalities for the recognition of users' emotional tendencies.

The current research on multimodal sentiment recognition is limited by incomplete data, sparse data, and ambiguous labels on the data side; modal alignment, and weak feature extraction ability on the model side. Moreover, the generality of the models is an important challenge. At present, although the development process of Industry 4.0 [1] has produced a large amount of data, the lack of its reasonable organization and application, and limited by the industrial field data open security issues, the data set can not be open to the public. However, the training of artificial intelligence algorithm models depends on a large amount of domain data.

In a recent recognition task, Huang et al. [2] proposed a cross-modal dynamic convolution method to solve the time-domain sparsity and alignment problems; Yu et al. [3] proposed a joint learning method to ensure learning consistency and discrepancy; Zhang et al. [4] proposed to solve the multimodal sentiment recognition label ambiguity problem in multi-label scenarios using label dependency. However, the above works are mainly in the academic domain for recognition of specific datasets.

Inspired by previous work, this paper proposes a BERT-based text feature-guided generic multimodal sentiment recognition method for text-image (TIBERT), which extracts the implicit features in the text by BERT, extracts the implicit features in images by ResNet, achieves data dimension alignment by using feature matrix transformation and achieves feature interaction and fusion by the feature-guided approach. The effectiveness and scalability of this paper's model are demonstrated by simulating industrial datasets by constructing a generic dataset and conducting extension experiments on publicly available datasets.

The main contributions of this paper are as follows.

  • In this paper, a general multimodal sentiment recognition algorithm model, TIBERT, is constructed to better tap into the potential opinion attitudes of users and assist in discovering their preferences for the development of Industry 4.0.

  • Feature interaction using matrix transformation and feature guidance is proposed to speed up the information extraction in the feature matrix, shorten the training speed, and improve the accuracy through the fusion of multimodal features and unimodal features.

  • Since there is no well-established and annotated multimodal sentiment dataset in the industry, this paper constructs a Weibo multimodal dataset to maximize the simulation of users' comments about industrial goods and services.

The model was tested on a self-built generic dataset and other publicly available datasets to demonstrate its effectiveness and generality.

The rest of this paper is structured as follows. Section 2 outlines the related work and explains the differences between this paper and previous approaches. Section 3 describes the details of the algorithm proposed in this paper and gives the flow of the algorithm in this paper. Section 4 conducts experiments, gives the results of comparative analysis, and performs the ablation experiments. Section 5 concludes the paper and gives an outlook.

Section snippets

Models for text classification

There will be a large amount of textual information among the rich media to be processed in the Industry 4.0 service layer. In text data processing methods implemented with Machine Learning (ML), attention mechanisms can give higher weight to key information in the text and are therefore widely used. The emergence of Transformer has brought attention to the extreme, and currently, popular pre-training models are mainly based on Transformer. For the text classification task, Peters et al. [5]

Methodology

The approach proposed in this paper can be formulated as follows: given a multimodal corpus, this paper aims to learn a multimodal sentiment classifier that can correctly predict users' emotions in a new sample. The ultimate goal is to achieve sentiment recognition of rich media information in the service layer under Industry 4.0.

Fig.1 illustrates the structure of the multimodal emotion recognition model in this paper. The model in this paper is divided into three parts: the BERT Text Feature

Experiment

This paper proposes a sentiment classification model to deal with rich media information related to industrial products and services generated during Industry 4.0. However, since there is no well-established and annotated multimodal sentiment dataset in the industry, this paper constructs a Weibo-based multimodal dataset based on the idea of a universal dataset to meet the experimental requirements. In addition, to test the generality and extensibility of the model in this paper, experiments

Conclusions

It is our goal to effectively identify users' emotional tendencies, fully explore their potential opinion attitudes towards industrial products and services, and help discover their needs and preferences, thus accelerating the development of Industry 4.0. To achieve this goal, the TIBERT model fully learns intra- and inter-modal feature information and fuses unimodal features and multimodal features, thus improving recognition accuracy. In order to be able to test the generality and scalability

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the Young Talent Program of the Organization Department of Liaoning Provincial Committee (Project No. XLYC1907001).

Bihui Yu is a master's student supervisor at Shenyang Institute of Computing Technology, Chinese Academy of Sciences. His research interests are industrial big data, knowledge engineering.

References (25)

  • H. Wen et al.

    Cross-modal dynamic convolution for multimodal emotion recognition

    Journal of Visual Communication and Image Representation

    (2021)
  • D.D. Chakladar et al.

    A multimodal-Siamese Neural Network (mSNN) for person verification using signatures and EEG

    Information Fusion

    (2021)
  • U.K. Lilhore et al.

    Impact of Deep Learning and Machine Learning in Industry 4.0: Impact of Deep Learning

    Cyber-Physical, IoT, and Autonomous Systems in Industry 4.0

    (2021)
  • W. Yu et al.

    Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

    arXiv preprint

    (2021)
  • D. Zhang et al.

    Multimodal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing

    Proceedings of the AAAI Conference on Artificial Intelligence

    (2021)
  • M.E. Peters et al.

    Deep contextualized word representations

    arXiv preprint

    (2018)
  • A. Radford et al.

    Improving language understanding by generative pre-training

    (2018)
  • J. Devlin et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    arXiv preprint

    (2018)
  • A.A. Khan et al.

    Machine Learning in Computer Vision: A Review

    (2021)
  • S. Zerdoumi et al.

    Image pattern recognition in big data: taxonomy and open challenges: survey

    Multimedia Tools and Applications

    (2018)
  • K. He et al.

    Rethinking imagenet pre-training

  • S. Tammina

    Transfer learning using vgg-16 with deep convolutional neural network for classifying images

    International Journal of Scientific and Research Publications (IJSRP)

    (2019)
  • Cited by (3)

    • Multimodal deep learning for predicting the choice of cut parameters in the milling process

      2022, Intelligent Systems with Applications
      Citation Excerpt :

      Next, the dynamic routing procedure relates the various sources to the different fault models and then performs the predictions (Fu et al., 2020). Multimodal machine learning has some challenges such as representation, translation, alignment (Yu et al., 2022), fusion (Poria et al., 2016), and co-learning (Rahate et al., 2022). These challenges appear different despite the existence of a crossover phenomenon.

    Bihui Yu is a master's student supervisor at Shenyang Institute of Computing Technology, Chinese Academy of Sciences. His research interests are industrial big data, knowledge engineering.

    Jingxuan Wei is a master's student at the University of Chinese Academy of Sciences. His training unit is Shenyang Institute of Computing Technology, Chinese Academy of Sciences. His interests are natural language processing, multimodal sentiment analysis.

    Bo Yu is a Ph.D. supervisor at the Shenyang Institute of Computing Technology, Chinese Academy of Sciences, and a senior member of the Chinese Computer Society. His research interests are multimedia data flow in Communication Network.

    Xingye Cai is a student at the University of Chinese Academy of Sciences, his research direction is knowledge engineering and natural language processing.

    Ke Wang is a student at the University of Chinese Academy of Sciences, his research direction is Named entity recognition, Information extraction.

    Huajun Sun is a master's student supervisor at Shenyang Institute of Computing Technology, Chinese Academy of Sciences. His research interests include processing multimedia data streams for real-time communication over networks.

    Liping Bu is a master's student supervisor at Shenyang Institute of Computing Technology, Chinese Academy of Sciences. Her research interests are industrial big data, knowledge engineering.

    Xiaowei Chen is a master's student at Shenyang Institute of Computing Technology, University of Chinese Academy of Sciences. His research interests include big data storage, multimodal classification.

    View full text