Abstract:
Vision-language (VL) models usually use an object detector to extract the features for the visual representation. The feature extractors relying on object detectors tend ...Show MoreMetadata
Abstract:
Vision-language (VL) models usually use an object detector to extract the features for the visual representation. The feature extractors relying on object detectors tend to suffer from ambiguity in the extracted visual embeddings caused by an overlap of (over-sampled) image regions. This in turn causes semantic misalignment of the corresponding visual and language embeddings in VL tasks. In this paper, we propose a simple approach to disambiguate the visual embeddings through segmentation and alignment within a word embedding space. In our experiments, we assess the impact of instance segmentation to improve the visual representations. We visualize bounding-box and segmentation-based feature vectors and show that the segmentation features contribute towards image region feature disambiguation. We also investigate an alignment process based on projection of image region features to a word embedding space and training a projection network using the corresponding object category labels. For the object categories assessed it is found that projecting image region features onto the word embedding space significantly reduces the cosine similarity between each category and all the other categories. This shows that the projection onto the word embedding space is an effective approach in achieving an alignment between the visual and language embeddings, which in turn helps disambiguate the image region features for further downstream VL tasks.
Published in: 2023 IEEE AFRICON
Date of Conference: 20-22 September 2023
Date Added to IEEE Xplore: 31 October 2023
ISBN Information: