Loading [a11y]/accessibility-menu.js
End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts | IEEE Journals & Magazine | IEEE Xplore

End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts


Abstract:

Multimodal named entity recognition (MNER) for social media aims to detect named entities in user-generated posts with the aid of visual information from attached images....Show More

Abstract:

Multimodal named entity recognition (MNER) for social media aims to detect named entities in user-generated posts with the aid of visual information from attached images. Existing methods use pretrained visual models or visual grounding (VG) toolkits to learn visual information. However, they still suffer from the mismatch issue, where the visual features extracted from visual encoder are inconsistent with actual requirements for cross-modal interaction. In an ideal scenario, the visual encoder should actively extract visual information guided by the text, which inherently provides the blueprint of desired visual features. In this article, we present an end-to-end VG framework for MNER task (VG-MNER), which adaptively learns the text-related visual features. Specifically, we introduce a backbone network with a feature fusion module to learn and aggregate multisize visual representations. We then develop a text-related visual attention to refine the visual features. Notably, entity-image contrast loss is designed to guide the training of visual encoder. The proposed model outperforms several state-of-the-art methods, achieving F1 scores of 75.62% and 88.11% on two benchmark datasets. Experimental results reveal the effectiveness of leveraging text-related visual information in the MNER task.
Published in: IEEE Transactions on Computational Social Systems ( Volume: 11, Issue: 6, December 2024)
Page(s): 7223 - 7233
Date of Publication: 12 June 2024

ISSN Information:

Funding Agency:


References

References is not available for this document.