Abstract
Pedestrian attribute recognition (PAR) ensures public safety and security. By automatically detecting attributes such as clothing color, accessories, and hairstyles, surveillance systems can provide valuable information for criminal investigations, aiding in identifying suspects based on their appearances. Additionally, in crowd management scenarios, PAR enables monitoring of specific groups, such as individuals wearing safety gear at construction sites or identifying potential threats in sensitive areas. Real-time attribute recognition enhances situational awareness and facilitates rapid response during emergencies, thereby contributing to public spaces’ overall safety and security. This work proposes applying the BLIP-2 Visual Question Answering (VQA) framework to address the PAR problem. By employing Large Language Models (LLMs), we have achieved an accuracy rate of 92% in the private set. This combination of VQA and LLMs makes it possible to effectively analyze visual information and answer questions related to pedestrian attributes, improving the accuracy and performance of PAR systems.
This work is partially funded by the Spanish Ministry of Science and Innovation under project PID2021-122402OB-C22, TED2021-131019B-10, and by the ACIISI-Gobierno de Canarias and European FEDER funds under projects ProID2021010012, ULPGC Facilities Net, and Grant EIS 2021 04.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123, 4–31 (2015)
Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: which investigated applications? Pattern Recognit. Lett. 151, 325–331 (2021)
Brown, T.B., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Freire-Obregón, D., De Marsico, M., Barra, P., Lorenzo-Navarro, J., Castrillón-Santana, M.: Zero-shot ear cross-dataset transfer for person recognition on mobile devices. Pattern Recogn. Lett. 166, 143–150 (2023)
Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vision 127(4), 398–414 (2019). https://doi.org/10.1007/s11263-018-1116-0
Greco, A., Vento, B.: PAR Contest 2023: pedestrian attributes recognition with multi-task learning. In: 20th International Conference on Computer Analysis of Images and Patterns: CAIP 2023. Springer, Cham (2023)
Kafle, K., Kanan, C.: An analysis of visual question answering algorithms. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1983–1991 (2017)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models (2023). https://doi.org/10.48550/arXiv.2301.12597
Li, Y., et al.: Competition-level code generation with alphacode. Science 378, 1092–1097 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021). https://proceedings.mlr.press/v139/radford21a.html
Sridhar, P., Lee, H., Dutta, A., Zisserman, A.: Wise image search engine (WISE). In: Wiki Workshop (2023)
Thoppilan, R., et al.: LaMDA: language models for dialog applications. arXiv abs/2201.08239 (2022)
Toor, A.S., Wechsler, H., Nappi, M.: Biometric surveillance using visual question answering. Pattern Recognit. Lett. 126, 111–118 (2019). https://doi.org/10.1016/j.patrec.2018.02.013. www.sciencedirect.com/science/article/pii/S0167865518300564. Robustness, Security and Regulation Aspects in Current Biometric Systems
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv abs/2302.13971 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Castrillón-Santana, M., Sánchez-Nielsen, E., Freire-Obregón, D., Santana, O.J., Hernández-Sosa, D., Lorenzo-Navarro, J. (2023). Evaluation of a Visual Question Answering Architecture for Pedestrian Attribute Recognition. In: Tsapatsoulis, N., et al. Computer Analysis of Images and Patterns. CAIP 2023. Lecture Notes in Computer Science, vol 14184. Springer, Cham. https://doi.org/10.1007/978-3-031-44237-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-44237-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44236-0
Online ISBN: 978-3-031-44237-7
eBook Packages: Computer ScienceComputer Science (R0)