VTN-EG: CLIP-Based Visual and Textual Fusion Network for Entity Grounding | IEEE Conference Publication | IEEE Xplore

VTN-EG: CLIP-Based Visual and Textual Fusion Network for Entity Grounding


Abstract:

In the field of space science and utilization, we have constructed a domain knowledge graph based on text data. However, a significant problem with this knowledge graph i...Show More

Abstract:

In the field of space science and utilization, we have constructed a domain knowledge graph based on text data. However, a significant problem with this knowledge graph is the absence of a large number of domain-relevant multimodal data. Entity grounding is the principal approach to addressing this issue, which refers to aligning external images with corresponding text entities. Currently, entity grounding has issues of entity name ambiguity and entity image sparseness. In response to these issues, we propose a method called VTN-EG, within which we design three new modules: prompt network module, attention module and decision module. The initial two modules use prompt network and cross-attention mechanism respectively to augment the expressiveness of image and text encoding. In the decision module, we obtain the final results based on the strategy of using the prompt network in visual discrimination process and not using it in text discrimination process. To support algorithmic research, we construct a specialized entity grounding dataset called the Chinese Space Science and Utilization Entity Grounding (SSUEG) and select four deep learning methods for comparison. Extensive experiments on SSUEG and two public datasets have proved the advancement and generalization of VTN-EG. Finally, we demonstrate the effectiveness and necessity of algorithm improvements through ablation experiments.
Date of Conference: 18-20 October 2024
Date Added to IEEE Xplore: 20 December 2024
ISBN Information:
Conference Location: Chengdu, China

Contact IEEE to Subscribe

References

References is not available for this document.