Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Xie, Peijin; Liu, Bingquan

doi:10.1007/s12559-024-10351-8

Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Research
Published: 05 September 2024

Volume 16, pages 3484–3504, (2024)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Peijin Xie¹ &
Bingquan Liu¹

172 Accesses
Explore all metrics

Abstract

Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Grounding Deliberate Reasoning in Multimodal Large Language Models

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-Modal Reasoning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets used in this study are publicly available.

Notes

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. arXiv:2010.11929.
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009;248–255. https://doi.org/10.1109/CVPR.2009.5206848.
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. (eds.) Computer Vision – ECCV 2014. 2014;740–755. Springer, Cham.
Kafle K, Kanan C. An analysis of visual question answering algorithms. In: ICCV. 2017.
Hudson DA, Manning CD. GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2019, Long Beach, CA, USA, June 16-20, 2019, 2019;6700–6709. https://doi.org/10.1109/CVPR.2019.00686 . http://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M. Towards VQA models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019;8309–8318. https://doi.org/10.1109/CVPR.2019.00851.
Marino K, Rastegari M, Farhadi A, Mottaghi R. OK-VQA: a visual question answering benchmark requiring external knowledge. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019;3190–3199. https://doi.org/10.1109/CVPR.2019.00331.
Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A corpus for reasoning about natural language grounded in photographs. 2018. arXiv:1811.00491.
Xie N, Lai F, Doran D, Kadav A. Visual entailment: a novel task for fine-grained image understanding. 2019. arXiv:1901.06706.
Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021;1244–1254.
Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;1244–1254
Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018;2236–2246. Association for Computational Linguistics, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1208 . https://aclanthology.org/P18-1208.
Chen Z, Wang P, Ma L, Wong K-YK, Wu Q. Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;10086–10095
Yang S, Li G, Yu Y. Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;9952–9961.
Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: contrastive captioners are image-text foundation models. (2022).
Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023).
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. 2020.
OpenAI: GPT-4 Technical Report. 2023.
Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. 2023. arXiv:2304.10592.
El Sayad I, Martinet J, Urruty T, Djeraba C. Toward a higher-level visual representation for content-based image retrieval. Multimed Tools Appl - MTA. 2010;60:1–28. https://doi.org/10.1007/s11042-010-0596-x.
Article Google Scholar
Sadeghi MA, Farhadi A. Recognition using visual phrases In: CVPR. 2011;2011:1745–52. https://doi.org/10.1109/CVPR.2011.5995711.
Article Google Scholar
Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. VisualBERT: a simple and performant baseline for vision and language. 2019.
Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: Meila M, Zhang T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;139:5583–5594. http://proceedings.mlr.press/v139/kim21k.html.
Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.
Lu J, Batra D, Parikh D, Lee S. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst. 2019;32.
Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. Ernie-vil: knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:3208–3216.
Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision. 2020;104–120. Springer.
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020;121–137. Springer
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. 2021.
Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. 2022. arXiv:2202.03052.
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.26 . https://aclanthology.org/2022.acl-long.26.
Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;26296–26306.
Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. 2023. arXiv:2308.12966.
Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y. Llama-adapter v2: parameter-efficient visual instruction model. 2023. arXiv:2304.15010.
Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W, Marathe K, Bitton Y, Gadre S, Sagawa S, et al. Openflamingo: an open-source framework for training large autoregressive vision-language models. 2023. arXiv:2308.01390.
Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019
Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: a benchmark for visual question answering using world knowledge. arXiv. 2022.
Lu P, Mishra S, Xia T, Qiu L, Chang K.-W, Zhu S.-C, Tafjord O, Clark P, Kalyan A. Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS). 2022.
Lerner P, Ferret O, Guinaudeau C, Le Borgne H, Besançon R, Moreno JG, Lovón Melgarejo J. ViQuAE, a dataset for knowledge-based visual question answering about named entities. In: Proceedings of The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’22. Association for Computing Machinery, New York, NY, USA. 2022. https://doi.org/10.1145/3477495.3531753 . https://hal.archives-ouvertes.fr/hal-03650618.
Feng J, Sun Q, Xu C, Zhao P, Yang Y, Tao C, Zhao D, Lin Q. MMDialog: a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. 2022.
Ustalov D, Pavlichenko N, Likhobaba D, Smirnova A. WSDM Cup 2023 Challenge on visual question answering. In: Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, Singapore. 2023;1–7. http://ceur-ws.org/Vol-3357/invited1.pdf.
Liu F, Emerson G, Collier N. Visual spatial reasoning. Trans Assoc Computat Linguist. 2023;11:635–51.
Article Google Scholar
Parcalabescu L, Cafagna M, Muradjan L, Frank A, Calixto I, Gatt A. VALSE: a task-independent benchmark for vision and language models centered on linguistic phenomena. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022;8253–8280. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.567 . https://aclanthology.org/2022.acl-long.567.
Manjunatha V, Saini N, Davis L. Explicit bias discovery in visual question answering models. 2019;9554–9563. https://doi.org/10.1109/CVPR.2019.00979.
Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
Si Q, Meng F, Zheng M, Lin Z, Liu Y, Fu P, Cao Y, Wang W, Zhou J. Language prior is not the only shortcut: a benchmark for shortcut learning in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022;3698–3712. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.271
Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In: European Conference on Computer Vision. 2016.
Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7W: grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vision. 2020;128. https://doi.org/10.1007/s11263-020-01316-z.
Xu D, Zhu Y, Choy C, Fei-Fei L. Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR). 2017.
Liang Y, Bai Y, Zhang W, Qian X, Zhu L, Mei T. VRR-VG: refocusing visually-relevant relationships. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10402–10411. IEEE Computer Society, Los Alamitos, CA, USA 2019. https://doi.org/10.1109/ICCV.2019.01050 . https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.01050.
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. 2020.
Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, Qiu Z, Lin W, Yang J, Zheng X, et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. 2023. arXiv:2306.13394.
Miller GA. Wordnet: a lexical database for english. Commun ACM. 1995;38(11):39–41. https://doi.org/10.1145/219717.219748.
Article Google Scholar

Download references

Funding

No funding was obtained for this study.

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, Harbin, China
Peijin Xie & Bingquan Liu

Authors

Peijin Xie
View author publications
You can also search for this author in PubMed Google Scholar
Bingquan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xie.P.: conceptualization, methodology, software, writing—original draft; Liu.B.: resources, supervision, writing—review and editing.

Corresponding author

Correspondence to Bingquan Liu.

Ethics declarations

Ethics Approval

This work does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. VPT Semantic Classification

In this section, we show details of each semantic type. Visual Phrases are grouped into 16 categories according to their semantics. The corresponding identification of each type of Visual Phrase represents the fundamental visual comprehension capability of the class. The raw Visual Phrases Triplet data of each type we collected is as follows:

1.
Secene Recognition identified whether the image depicts an indoor or outdoor scene. For example, (scene, is, indoor) and (scene, is, outdoor). It was classified by the object word with a indoor and outdoor dictionary.
2.
Indoor room Recognition recognized the specific indoor location, e.g., (location, is, living room) and (it, is in, bathroom).It was classified by the object word with a indoor location dictionary.
3.
Outdoor Season Recognition distinguished the outdoor season, e.g., (season, is, winter) and (it, is, spring). It was classified by the object word with a season dictionary.
4.
Outdoor Weather Recognition distinguished the outdoor weather, e.g., (it, is, rainy), (weather, is, snowy) It was classified by the object word with a weather dictionary.
5.
Object Counting illustrate the number information of the present object, e.g., (people, accumulate, two) and (cats, accumulate, five). It was generated through the MLSG generation process at global part.
6.
Object presence determined the existence of the primary object in the photo, e.g., (vehicles, are, present), (bottle, is, present). It was generated by the subject word in the VPTs.
7.
Object Compare show the comparison relation of the subject-object pair, e.g., (brownies, are larger than, plate) and (tree in [248, 55, 64, 34], is taller than, tree in [245, 92, 26, 16]). If subject and object share the same word in the situation of comparison, we append their coordinates to prevent ambiguity. It was classified by the relation words with a comparing dictionary.
8.
Object Color describe the color attribute of the subject word, e.g., (cat, is, black) and (eyes, are, blue). It was classified by the object word with a color dictionary.
9.
Object Shape describe the shape attribute of the subject word, e.g., (hair, is, curly) and (speaker, is, round). It was classified by the object word with a shape dictionary.
10.
Object Material distinguished the material attribute of the subject word, e.g., (table, is made of, glass) and (floor, is made of, wood). It was classified by the object word with a material dictionary.
11.
Object Other Attribute illustrate the remaining other attribute from color, shape and material, e.g., (door, is open) and (box, is, empty). It was classified by the object word with the rest attribute dictionary.
12.
Possession Relation are collected from the visual phrases like “somebody’s something”. For example, “boy’s dirty hands” and “girl’s long hair.” Then, transform them to visual phrases with relation word “is with,” e.g., (boy, is with, dirty hands) and (girl, is with, long hair). It should be noted that this category is considered as a 2-hop VPT due to the fact that phrases such as ’boy with hands’ and ’girl with hair’ primarily represent common knowledge rather than visual comprehension ability. So the sub-object is appended with an attribute word like “dirty” or “long” as a 2-hop VPT to test the visual comprehension capability.
13.
Spatial Reasoning describe the positional relation between the subject and object, e.g., (man, is walking across, street), (mountain, is behind, train). Note that some visual phrases of action like “Human activity” and “Other movement” mentioned after are also assigned to the category “Spatial Reasoning.” For example, (man, walk toward, bench) and (cat, climb up, tree). It was classified by the relation word with a spatial relation dictionary.
14.
Human Activity told the difference of human daily action, e.g., (children, play, table tennis), (girl, eat, pizza)) It was classified by the “H-O” label with a huaman and a object dictionaries.
15.
Other Movement described other actions with nonhuman subject words, e.g., (birds, fly in, sky) and (signal lamp, stands on the side of, road). It was classified by the “Object” label and relation verb dictionary.
16.
Human Sentiment recognized the Human sentiment, e.g., (kids, are, happy) and (man, looks, sad). It was classified by the object word with the sentiment dictionary.

Appendix 2. VPT Basic Statistics

In this section, we present comprehensive statistics of VPTs, encompassing four dimensions of label distribution of “Static/Dynamic,” “Global/Local,” ”Entity Combination,” and “Semantic.” The first three proportion of them are shown in the pie gram and the last distribution of “Semantic” is shown in a histogram.

Figure 6 shows the distribution of “Static/Dynamic” where “Prepositive Attribute” accounted for the largest proportion and reached 39%. Then, “Postpositive Attribute” accounted for 27% and normal “Dynamic” for 23%. The “Dynamic-Special” got the lowest proportion of 11%. Figure 7 illustrates the ratio between “Global/Local” VPTs as 22:78. Figure 8 demonstrates the proportion of “Entity Combination.” The combination of “O-A” accounted for the largest proportion of 43% followed by “O-O” 26%, “H-O” 17%, “H-A” 10%, and “H-H” 4%.

The distribution of 16 “Semantic” types are shown in Fig. 9. In this histogram, “Object Presence” refers to the most frequent type followed by “Object Other Attribute” and “Spatial Reasoning.” The types of “Object Counting,” “Human Activity,” and “Other Movement” account for nearly 8% of the whole data. And the least 5 parts refer to “Outdoor Weather Recognition,” “Outdoor Season Recognition,” “Indoor Room Recognition,” “Object Compare,” and “Human Sentiment.”

Table 6 Examples of transformation from uni-hop VPT to descriptive statement text on each semantic category

Full size table

Appendix 3. Transformation from Uni-hop VPT to Descriptive Statement

In this section, we illustrate the transformation process to generate descriptive statements from sampled VPT in 16 categories. For each category, we generate descriptive statements by using randomly sampled template. Each type of semantic VPT possesses its own set of templates. The details of the transformation are shown in Table 6. For each semantic category, we select one VPT as an exemplar and then randomly sample two common templates from the corresponding template set. Subsequently, we transform the VPT into a descriptive statement using different templates.

Taking the “Object Color” category as an example, we predefine a set of templates for it in advance including “[subject] [relation] [object],” “there [relation] [object] [subject],” “[subject] [look/looks] in [object],” “[object] [subject] [relation] visible,” and so on. Then, we choose an according VTP as (plate, is, white) and sample a template from the set like “[subject] [relation] [object]” or “there [relation] [object] [subject].” An article word is appended with the noun word before inserting the VPT into the template, resulting in text such as ”the plate is white” or ”there is a white plate.”

Appendix 4. Compositional Descriptive Statement Engine

In this section, we present a comprehensive guide on the formulation of Compositional Descriptive Statements. The compositional texts are automatically created by our Compositional Descriptive Statement Engine(CDSE). CDSE generates both positive and negative texts with the two inputs: a Multi-level Scene Graph(MLSG) and a number of hops.

Algorithms 1 and 2 details the main process. CDSE is shown in details in Algorithm 1. It describes the overall process to generate n-hop VPT. It randomly sampled n VPTs from the scene graph. Those VPTs constitute multiple connected components. Each connected component needs to be transformed into descriptive text and finally to be connected with the conjunction word “and.” Simple components consisting of 2 nodes and 1 edge like all VPTs in the global part or some isolated VPT in the local part can be easily transformed by concatenating 3 parts of the triplet. Complex components with more than 2 nodes or 1 edge can be transformed by the Algorithm 2.

In other words, to generate n-hop text, we sampled n uni-hop VPTs from MLSG as a sub-graph. We partition the sub-graph into local and global components, subsequently integrating the text generated from both parts.

For the global part, we transform each global VPT into text and concat them with the conjunction word “and.” Figure 12 illustrates an example of the text of global part generation.

For the local part, we separate it into several connected components. Then, we generate text for each connected component and also concat them with the conjunction word “and.” Figure 13 demonstrates an example of local compositional descriptive statement text generation from the single connected component.

Then, we replace the combination of local and global parts with a resampled VPT and generate the negative text. The structure of the sub-graph remains unaltered, and we selectively modify only one component of the triple. If we select a node, we resample the node from the same distribution of the “semantic” type (e.g., resample “red” in (apple, is, red) to “yellow” in the “color” attribute node distribution). And if a relation edge is selected, we also resample it from the same distribution of “semantic” type (e.g., resample “is on the left of” in (tree, is on the left of, car) to “is in front of” in “spatial reasoning” relation edge distribution). This arrangement ensures that positive and negative samples have the same distribution under the same “semantic” category. And we get harder negative samples relatively which makes the binary classification task more challenging.

Text generation from single connected component is shown in details in Algorithm 2. It splits the complex component into 2 trees by a randomly selected main relation edge (e.g., “is holding” in red as a main edge and nodes “boy” and “bottle” also in red to be the root node of 2 split trees in Fig. 13. Then, 2 trees are traversed by depth-first search from the root node. During the traversal, nodes and edges are aggregated as prepositive and postpositive attributes of the root node in order under the guidance of the tagged edge feature “Static.” Figure 14 shows how the root node “cat” generates “black sleeping cat on the table.” Finally, we concat the compositional subject and object from 2 trees with the main relation word to get the descriptive text of the complex component.

During the process, we randomly sample a “dynamic” VPT as a main VPT (if it exists). If not, we select another VPT instead. Then, we aggregate the “static” attribute nodes to reduce complexity and assemble the subject and object part of the main VPT through deep first search(dfs) traversal. Then, we combine the compositional subject and object nodes to generate the corresponding text description same as the one-hop VPT did. Finally, we get multi-hop VPT and transform it into text.

“Static” nodes aggregation We aggregate the “static” attribute VPTs to reduce the complexity. The “static” attribute VPTs are labeled by “Static/Dynamic” including “prepositive attribute,” “postpositive attribute,” and “special dynamic.” We aggregate “special dynamic” and “prepositive attribute” in front of the noun word and append “postpositive attribute” at the rear. Noted that “special dynamic” gets higher priority than “prepositive attribute” and becomes closer to the noun word. As an example shown in Fig. 14, we aggregate the “prepositive attribute” VPT of (cat, is, black), “special dynamic” VPT (cat, is, sleeping) and “postpositive attribute” VPT (cat, is on, table) in order and generate a compositional node “black sleeping cat on table.”

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, P., Liu, B. Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data. Cogn Comput 16, 3484–3504 (2024). https://doi.org/10.1007/s12559-024-10351-8

Download citation

Received: 29 April 2024
Accepted: 12 August 2024
Published: 05 September 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s12559-024-10351-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Grounding Deliberate Reasoning in Multimodal Large Language Models

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-Modal Reasoning

Data Availability

Notes

References

Funding