Skip to main content

Advertisement

Log in

Design and development of counting-based visual question answering model using heuristic-based feature selection with deep learning

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Visual Question Answering (VQA) is the most significant area that adopts both computer vision techniques and natural language processing techniques. Among all the question types, the most challenging question type is said to be counting, such as “How many?” Still, VQA models consist of certain difficulties in counting the objects that are present in the natural images. The basic technique in the VQA involved either classifying answers according to a definite-length description of both the question and image or estimating summing fractional counts from every image segment. Soft attention in these methods is utilized to find these primary issues. To circumvent this problem, the main intention of this paper is to implement the latest visual question-answering system based on a counting scenario. At first, the standard benchmark datasets related to the visual question-answering system are gathered. This question-answering system dataset is usually incorporated with both images and questions. Hence, feature extraction is adopted for both questions and images. For the questions, the text pre-processing is initially employed by punctuation removal, stemming, and stop word removal and the word2vec features are extracted. Similarly, the deep features of the given images are extracted from the pooling layer of the Deep Convolutional Neural Network (DCNN). These two sets of features are integrated and are fed to the selection of optimal feature procedures for acquiring the most significant features that are giving unique information. The selection of optimal features is handled by the Optimized Deep Neural-Long Short-Term Memory (DN-LSTM). It needs less time and computational complexity and also can be applied to solving all engineering optimization problems. It also can tackle multilevel thresholding problems. These advantages in the Parameter Improved-Elephant Herding Optimization (PI-EHO) over the conventional optimization algorithms seek more attention for choosing the EHO in the designed method. Finally, the answer generation is done by hybrid deep learning with Long Short Term Memory (LSTM) and Deep Neural Network (DNN), for which the architecture is improvised by the proposed EHO. The given designed method is experimented on the different data sets, yielding promising results when compared to existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data underlying this article are available in the visual question-answering database, at https://visualqa.org/download.html.

References

  • Ambati, Loknath Sai and El-Gayar, Omar (2021) Human Activity Recognition: A Comparison of Machine Learning Approaches. Journal of the Midwest Association for Information Systems

  • Loknath Sai Ambati, Kanthi Narukonda,Giridhar Reddy Bojja, Dave Bishop (2020d) Factors Influencing the Adoption of Artificial Intelligence in Organizations-From an Employee's Perspective," Adoption of AI in organization from employee perspective

  • Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2017) VQA: visual question answering. Int J Comput vis 123:4–31

    Article  MathSciNet  Google Scholar 

  • Baek J-W, Chung K-Y (2021) Multimedia recommendation using word2Vec-based social relationship mining. Multimedia Tools Appl 80:34499–34515

    Article  Google Scholar 

  • Bui QT, Pham MV, Nguyen QH, Nguyen LX, Pham HM (2019) Whale optimization algorithm and adaptive neuro-fuzzy inference system: a hybrid method for feature selection and land pattern classification. Int J Remote Sensing. https://doi.org/10.1080/01431161.2019.1578000

    Article  Google Scholar 

  • Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh, (2017) Counting Everyday Objects in Everyday Scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1135–1144

  • Chen SW et al (2017) counting apples and oranges with deep learning: a data-driven approach. IEEE Robot Automation Lett 2(2):781–788

    Article  Google Scholar 

  • M. Chen, Y. Wang, S. Chen and Y. Wu, (2019a) Counting Attention Based on Classification Confidence for Visual Question Answering. IEEE International Conference on Big Data and Cloud Computing (BdCloud), pp. 1173–1179

  • Chen M, Wang Y, Chen S, Wu Y (2019b) Counting Attention based on classification confidence for visual question answering. IEEE Intl Conf Parallel Distrib Process Appl Big Data Cloud Comput Sustain Comput Commun Soc Comput Netw. https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00167

    Article  Google Scholar 

  • Chen C, Han D, Wang J (2020) Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 8:35662–35671

    Article  Google Scholar 

  • I. Chowdhury, K. Nguyen, C. Fookes and S. Sridharan, (2017) A cascaded long short-term memory (LSTM) driven generic visual question answering (VQA). IEEE International Conference on Image Processing (ICIP), pp. 1842–1846

  • Gao D, Wang R, Shan S, Chen X (2020) Learning to recognize visual concepts for visual question answering with structural label space. IEEE J Sel Topics Signal Process 14(3):494–505

    Article  Google Scholar 

  • He T, Droppo J (2016) Exploiting LSTM structure in deep neural networks for speech recognition. IEEE Int Conf Acoust Speech Signal Process (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472718

    Article  Google Scholar 

  • He S, Han C, Han G, Qin J (2020) Exploring duality in visual question-driven top-down saliency. IEEE Trans Neural Netw Learn Syst 31(7):2672–2679

    Google Scholar 

  • Jagadeeshwar TL, Kalyani S, Rajagopal P, Srinivasan B (2021) Statistics-based baseline-free approach for rapid inspection of delamination in composite structures using ultrasonic guided waves. Struct Health Monit. https://doi.org/10.1177/14759217211073335

    Article  Google Scholar 

  • Kadhim A (2018) An evaluation of preprocessing techniques for text classification. Int J Comput Sci Inf Secur 16(6):22–32

    Google Scholar 

  • Kafle S, de Silva N, Dou D (2020) An overview of utilizing knowledge bases in neural networks for question answering. Inf Syst Front 22:1095–1111

    Article  Google Scholar 

  • Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6:31516–31524

    Article  Google Scholar 

  • Li J, Lei H, Alavi AH, Wang GG (2020) Elephant herding optimization: variants, hybrids, and applications. Mathematics. https://doi.org/10.3390/math8091415

    Article  Google Scholar 

  • Liang J, Jiang L, Cao L, Kalantidis Y, Li L, Hauptmann AG (2019) Focal visual-text attention for memex question answering. IEEE Trans Pattern Anal Mach Intell 41(8):1893–1908

    Article  Google Scholar 

  • Lobry S, Marcos D, Kellenberger B, Tuia D (2020) Better generic objects counting when asking questions to images: a multitask approach for remote sensing visual question answering. ISPRS Ann Photogramm Remote Sens Spat Inform Sci. https://doi.org/10.5194/isprs-annals-V-2-2020-1021-2020

  • Lobry S, Marcos D, Murray J, Tuia D (2020) RSVQA: visual question answering for remote sensing data. IEEE Trans Geosci Remote Sens 58(12):8555–8566

    Article  Google Scholar 

  • Maaike de Boer, Steven Reitsma and Klamer Schutte, (2016b) Counting in Visual Question Answering. Dutch-Belgian Information Retrieval Workshop

  • Miyanishi T, Maekawa T, Kawanabe M (2021) Sim2RealQA: using life simulation to solve question answering real-world events. IEEE Access 9:75003–75020

    Article  Google Scholar 

  • Duy-Kien Nguyen, VedanujGoswami and Xinlei Chen, (2020b) MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond. Computer Vision and Pattern Recognition

  • Park S, Hwang S, Hong J, Byun H (2020) Fair-VQA: fairness-aware visual question answering through sensitive attribute prediction. IEEE Access 8:215091–215099

    Article  Google Scholar 

  • Sahoo RM, Padhy SK (2020c) Elephant herding optimization for multiprocessor task scheduling in heterogeneous environment. Comput Intell Pattern Recognit. https://doi.org/10.1007/978-981-15-2449-3_18

    Article  Google Scholar 

  • Stefan Schneider and Alex Zhuang (2020a) Counting Fish and Dolphins in Sonar Images Using Deep Learning. Computer Vision and Pattern Recognition 24

  • Song H, Liang H, Li H, Dai Z, Yun X (2019) Vision-based vehicle detection and counting system using deep learning in highway scenes. Eur Transp Res Rev. https://doi.org/10.1186/s12544-019-0390-4

    Article  Google Scholar 

  • Tabjula JL, Kanakambaran S, Kalyani S, Rajagopal P, Srinivasan B (2021) Outlier analysis for defect detection using sparse sampling in guided wave structural health monitoring. Struct Contr Health Monit. https://doi.org/10.1002/stc.2690

    Article  Google Scholar 

  • Trott A, Xiong C, Socher R (2018) Interpretable counting for visual question answering. Artif Intell. https://doi.org/10.48550/arXiv.1712.08697

    Article  Google Scholar 

  • Vosooghifard M, Ebrahimpour H (2015) Applying Grey Wolf Optimizer-based decision tree classifier for cancer classification on gene expression data. Int Conf Comput Knowl Eng (ICCKE). https://doi.org/10.1109/ICCKE.2015.7365818

    Article  Google Scholar 

  • Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) FVQA: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  • Wu Q, Shen C, Wang P, Dick A, A. v. d. Hengel, (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381

    Article  Google Scholar 

  • Yang C, Jiang M, Jiang B, Zhou W, Li K (2019) Co-attention network with question type for visual question answering. IEEE Access 7:40771–40781

    Article  Google Scholar 

  • Yu J et al (2020) Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans Multimedia 22(12):3196–3209

    Article  Google Scholar 

  • Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. Proc IEEE Conf Comput vis Pattern Recognit (CVPR). https://doi.org/10.1109/CVPR.2015.7298684

    Article  Google Scholar 

  • Zhang J, Ma S, Sameki M, Sclaroff S, Betke M, Lin Z, Shen X, Price B, Mech R (2016a) Salient Object Subitizing. Comput vis Pattern Recognit. https://doi.org/10.1007/s11263-017-1011-0

    Article  Google Scholar 

  • Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. Comput vis Pattern Recognit. https://doi.org/10.48550/arXiv.1802.05766

    Article  Google Scholar 

  • Xiaoqin Zhang, Weiming Hu, S. Maybank, Xi Li, and Mingliang Zhu, (2008) Sequential particle swarm optimization for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition

  • Jianming Zhang, Shugao Ma, MehrnooshSameki, Stan Sclaroff, MargritBetke, Zhe Lin, XiaohuiShen, Brian Price, RadomirMech (2015) Salient Object Subitizing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4045–4054

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lejian Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Welde, T.M., Liao, L. Design and development of counting-based visual question answering model using heuristic-based feature selection with deep learning. Artif Intell Rev 56, 8859–8888 (2023). https://doi.org/10.1007/s10462-022-10385-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10385-0

Keywords

Navigation