Abstract:
Empowering machines to understand our physical world should go beyond models with only natural language and models with only vision. Vision and language comprise a growin...Show MoreMetadata
Abstract:
Empowering machines to understand our physical world should go beyond models with only natural language and models with only vision. Vision and language comprise a growing field of study that attempts to bridge the gap between natural language processing and computer vision communities by enabling models to learn visually grounded language. However, as an increasing number of pre-trained visual linguistic models focus on the alignment between visual regions and natural language, it is difficult to claim that these models capture certain properties of objects in their latent space, such as size. Inspired by recent trends in prompt learning, this study designed a prompt learning framework for two visual linguistic models, ViLBERT and ViLT, and used manually crafted prompt templates to evaluate the consistency of performance of these models in comparing the size of objects.
Published in: 2022 IEEE 21st International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)
Date of Conference: 08-10 December 2022
Date Added to IEEE Xplore: 21 April 2023
ISBN Information: