Abstract
Visual Question Answering (VQA) is the most significant area that adopts both computer vision techniques and natural language processing techniques. Among all the question types, the most challenging question type is said to be counting, such as “How many?” Still, VQA models consist of certain difficulties in counting the objects that are present in the natural images. The basic technique in the VQA involved either classifying answers according to a definite-length description of both the question and image or estimating summing fractional counts from every image segment. Soft attention in these methods is utilized to find these primary issues. To circumvent this problem, the main intention of this paper is to implement the latest visual question-answering system based on a counting scenario. At first, the standard benchmark datasets related to the visual question-answering system are gathered. This question-answering system dataset is usually incorporated with both images and questions. Hence, feature extraction is adopted for both questions and images. For the questions, the text pre-processing is initially employed by punctuation removal, stemming, and stop word removal and the word2vec features are extracted. Similarly, the deep features of the given images are extracted from the pooling layer of the Deep Convolutional Neural Network (DCNN). These two sets of features are integrated and are fed to the selection of optimal feature procedures for acquiring the most significant features that are giving unique information. The selection of optimal features is handled by the Optimized Deep Neural-Long Short-Term Memory (DN-LSTM). It needs less time and computational complexity and also can be applied to solving all engineering optimization problems. It also can tackle multilevel thresholding problems. These advantages in the Parameter Improved-Elephant Herding Optimization (PI-EHO) over the conventional optimization algorithms seek more attention for choosing the EHO in the designed method. Finally, the answer generation is done by hybrid deep learning with Long Short Term Memory (LSTM) and Deep Neural Network (DNN), for which the architecture is improvised by the proposed EHO. The given designed method is experimented on the different data sets, yielding promising results when compared to existing methods.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data underlying this article are available in the visual question-answering database, at https://visualqa.org/download.html.
References
Ambati, Loknath Sai and El-Gayar, Omar (2021) Human Activity Recognition: A Comparison of Machine Learning Approaches. Journal of the Midwest Association for Information Systems
Loknath Sai Ambati, Kanthi Narukonda,Giridhar Reddy Bojja, Dave Bishop (2020d) Factors Influencing the Adoption of Artificial Intelligence in Organizations-From an Employee's Perspective," Adoption of AI in organization from employee perspective
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2017) VQA: visual question answering. Int J Comput vis 123:4–31
Baek J-W, Chung K-Y (2021) Multimedia recommendation using word2Vec-based social relationship mining. Multimedia Tools Appl 80:34499–34515
Bui QT, Pham MV, Nguyen QH, Nguyen LX, Pham HM (2019) Whale optimization algorithm and adaptive neuro-fuzzy inference system: a hybrid method for feature selection and land pattern classification. Int J Remote Sensing. https://doi.org/10.1080/01431161.2019.1578000
Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh, (2017) Counting Everyday Objects in Everyday Scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1135–1144
Chen SW et al (2017) counting apples and oranges with deep learning: a data-driven approach. IEEE Robot Automation Lett 2(2):781–788
M. Chen, Y. Wang, S. Chen and Y. Wu, (2019a) Counting Attention Based on Classification Confidence for Visual Question Answering. IEEE International Conference on Big Data and Cloud Computing (BdCloud), pp. 1173–1179
Chen M, Wang Y, Chen S, Wu Y (2019b) Counting Attention based on classification confidence for visual question answering. IEEE Intl Conf Parallel Distrib Process Appl Big Data Cloud Comput Sustain Comput Commun Soc Comput Netw. https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00167
Chen C, Han D, Wang J (2020) Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 8:35662–35671
I. Chowdhury, K. Nguyen, C. Fookes and S. Sridharan, (2017) A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA). IEEE International Conference on Image Processing (ICIP), pp. 1842–1846
Gao D, Wang R, Shan S, Chen X (2020) Learning to recognize visual concepts for visual question answering with structural label space. IEEE J Sel Topics Signal Process 14(3):494–505
He T, Droppo J (2016) Exploiting LSTM structure in deep neural networks for speech recognition. IEEE Int Conf Acoust Speech Signal Process (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472718
He S, Han C, Han G, Qin J (2020) Exploring duality in visual question-driven top-down saliency. IEEE Trans Neural Netw Learn Syst 31(7):2672–2679
Jagadeeshwar TL, Kalyani S, Rajagopal P, Srinivasan B (2021) Statistics-based baseline-free approach for rapid inspection of delamination in composite structures using ultrasonic guided waves. Struct Health Monit. https://doi.org/10.1177/14759217211073335
Kadhim A (2018) An evaluation of preprocessing techniques for text classification. Int J Comput Sci Inf Secur 16(6):22–32
Kafle S, de Silva N, Dou D (2020) An overview of utilizing knowledge bases in neural networks for question answering. Inf Syst Front 22:1095–1111
Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6:31516–31524
Li J, Lei H, Alavi AH, Wang GG (2020) Elephant herding optimization: variants, hybrids, and applications. Mathematics. https://doi.org/10.3390/math8091415
Liang J, Jiang L, Cao L, Kalantidis Y, Li L, Hauptmann AG (2019) Focal visual-text attention for memex question answering. IEEE Trans Pattern Anal Mach Intell 41(8):1893–1908
Lobry S, Marcos D, Kellenberger B, Tuia D (2020) Better generic objects counting when asking questions to images: a multitask approach for remote sensing visual question answering. ISPRS Ann Photogramm Remote Sens Spat Inform Sci. https://doi.org/10.5194/isprs-annals-V-2-2020-1021-2020
Lobry S, Marcos D, Murray J, Tuia D (2020) RSVQA: visual question answering for remote sensing data. IEEE Trans Geosci Remote Sens 58(12):8555–8566
Maaike de Boer, Steven Reitsma and Klamer Schutte, (2016b) Counting in Visual Question Answering. Dutch-Belgian Information Retrieval Workshop
Miyanishi T, Maekawa T, Kawanabe M (2021) Sim2RealQA: using life simulation to solve question answering real-world events. IEEE Access 9:75003–75020
Duy-Kien Nguyen, VedanujGoswami and Xinlei Chen, (2020b) MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond. Computer Vision and Pattern Recognition
Park S, Hwang S, Hong J, Byun H (2020) Fair-VQA: fairness-aware visual question answering through sensitive attribute prediction. IEEE Access 8:215091–215099
Sahoo RM, Padhy SK (2020c) Elephant herding optimization for multiprocessor task scheduling in heterogeneous environment. Comput Intell Pattern Recognit. https://doi.org/10.1007/978-981-15-2449-3_18
Stefan Schneider and Alex Zhuang (2020a) Counting Fish and Dolphins in Sonar Images Using Deep Learning. Computer Vision and Pattern Recognition 24
Song H, Liang H, Li H, Dai Z, Yun X (2019) Vision-based vehicle detection and counting system using deep learning in highway scenes. Eur Transp Res Rev. https://doi.org/10.1186/s12544-019-0390-4
Tabjula JL, Kanakambaran S, Kalyani S, Rajagopal P, Srinivasan B (2021) Outlier analysis for defect detection using sparse sampling in guided wave structural health monitoring. Struct Contr Health Monit. https://doi.org/10.1002/stc.2690
Trott A, Xiong C, Socher R (2018) Interpretable counting for visual question answering. Artif Intell. https://doi.org/10.48550/arXiv.1712.08697
Vosooghifard M, Ebrahimpour H (2015) Applying Grey Wolf Optimizer-based decision tree classifier for cancer classification on gene expression data. Int Conf Comput Knowl Eng (ICCKE). https://doi.org/10.1109/ICCKE.2015.7365818
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) FVQA: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wu Q, Shen C, Wang P, Dick A, A. v. d. Hengel, (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Yang C, Jiang M, Jiang B, Zhou W, Li K (2019) Co-attention network with question type for visual question answering. IEEE Access 7:40771–40781
Yu J et al (2020) Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans Multimedia 22(12):3196–3209
Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. Proc IEEE Conf Comput vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2015.7298684
Zhang J, Ma S, Sameki M, Sclaroff S, Betke M, Lin Z, Shen X, Price B, Mech R (2016a) Salient Object Subitizing. Comput vis Pattern Recognit. https://doi.org/10.1007/s11263-017-1011-0
Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. Comput vis Pattern Recognit. https://doi.org/10.48550/arXiv.1802.05766
Xiaoqin Zhang, Weiming Hu, S. Maybank, Xi Li, and Mingliang Zhu, (2008) Sequential particle swarm optimization for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition
Jianming Zhang, Shugao Ma, MehrnooshSameki, Stan Sclaroff, MargritBetke, Zhe Lin, XiaohuiShen, Brian Price, RadomirMech (2015) Salient Object Subitizing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4045–4054
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Welde, T.M., Liao, L. Design and development of counting-based visual question answering model using heuristic-based feature selection with deep learning. Artif Intell Rev 56, 8859–8888 (2023). https://doi.org/10.1007/s10462-022-10385-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10385-0