Affective Video Content Analysis With Adaptive Fusion Recurrent Network | IEEE Journals & Magazine | IEEE Xplore

Affective Video Content Analysis With Adaptive Fusion Recurrent Network


Abstract:

Affective video content analysis is an important research topic in video content analysis and has extensive applications. Intuitively, multimodal features can depict elic...Show More

Abstract:

Affective video content analysis is an important research topic in video content analysis and has extensive applications. Intuitively, multimodal features can depict elicited emotions, and the accumulation of temporal inputs influences the viewer's emotion. Although a number of research works have been proposed for this task, the adaptive weights of modalities and the correlation of temporal inputs are still not well studied. To address these issues, a novel framework is designed to learn the weights of modalities and temporal inputs from video data. Specifically, three network layers are designed, including statistical-data layer to improve the robustness of data, temporal-adaptive-fusion layer to fuse temporal inputs, and multimodal-adaptive-fusion layer to combine multiple modalities. In particular, the feature vectors of three input modalities are respectively extracted from three pre-trained convolutional neural networks and then fed to three statistical-data layers. Then, the output vectors of these three statistical-data layers are separately connected to three recurrent layers, and the corresponding outputs are fed to a fully-connected layer which shares parameters across modalities and temporal inputs. Finally, the outputs of the fully-connected layer are fused by the temporal-adaptive-fusion layer and then combined by the multimodal-adaptive-fusion layer. To discover the correlation of both multiple modalities and temporal inputs, adaptive weights of modalities and temporal inputs are introduced into loss functions for model training, and these weights are learned by an optimization algorithm. Extensive experiments are conducted on two challenging datasets, which demonstrate that the proposed method achieves better performances than baseline and other state-of-the-art methods.
Published in: IEEE Transactions on Multimedia ( Volume: 22, Issue: 9, September 2020)
Page(s): 2454 - 2466
Date of Publication: 22 November 2019

ISSN Information:

Funding Agency:


References

References is not available for this document.