skip to main content
10.1145/3605423.3605427acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicctaConference Proceedingsconference-collections
research-article

Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images

Published:20 August 2023Publication History

ABSTRACT

With millions of speakers worldwide, Bangla is one of the most spoken languages in the world and as a result, a large number of people rely on Bangla as their primary medium of communication. Among them, a lot of individuals have visual impairments including but not limited to central vision loss, peripheral vision loss, and blurry vision. This poses several challenges to these individuals - one of them being extracting information from images. While techniques such as image captioning have been proposed to address this issue, visual question answering (VQA) offers a much in-depth and robust way to understand an image. However, there has been a paucity of research into developing a VQA system in Bangla to assist individuals with visual impairments. To reduce this gap, we have introduced a VQA system in Bangla designed to assist visually impaired individuals. VQA is a multifaceted problem, and in this paper we focus on finding the spatial relationships between objects. We have broken down this problem into three sub-tasks: object detection, object counting, and finally relative positioning for the detected objects. The system takes in questions from the user, understands which sub-task to perform and then returns the answer. We have have leveraged several pre-trained models such as Bangla-BERT, EfficientDet-D7, InceptionResNetV2, and MiDas v2.1. The major aspects of this paper are the introduction of a procedurally generated dataset to train models to identify what action to perform based on the prompt of the user and using image segmentations to identify the relative spatial position between objects in all three spatial dimensions.

References

  1. Shafi Ahmed, Md Humaion Kabir Mehedi, Moh Rahman, and Jawad Bin Sayed. 2022. Bangla Music Lyrics Classification. 142–147. https://doi.org/10.1145/3543712.3543752Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abhik Bhattacharjee, Tahmid Hasan, Wasi Ahmad, Kazi Samin Mubasshir, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, and Rifat Shahriyar. 2022. BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Findings of the Association for Computational Linguistics: NAACL 2022 (2022). https://doi.org/10.18653/v1/2022.findings-naacl.98Google ScholarGoogle ScholarCross RefCross Ref
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  4. Brendan Dineen, Rupert Bourne, S Ali, D Huq, and G Johnson. 2003. Prevalence and causes of blindness and visual impairment in Bangladeshi adults: Results of the National Blindness and Low Vision Survey of Bangladesh. The British journal of ophthalmology 87 (07 2003), 820–8. https://doi.org/10.1136/bjo.87.7.820Google ScholarGoogle ScholarCross RefCross Ref
  5. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457–468. https://doi.org/10.18653/v1/D16-1044Google ScholarGoogle ScholarCross RefCross Ref
  6. Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C. H. Hoi, and Xiaogang Wang. 2018. Question-Guided Hybrid Convolution for Visual Question Answering. In Computer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 485–501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarGoogle ScholarCross RefCross Ref
  8. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation 9 (12 1997), 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163 (2017), 3–20. https://doi.org/10.1016/j.cviu.2017.06.005 Language in Vision.Google ScholarGoogle ScholarCross RefCross Ref
  10. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 289–297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rakin Mostafa, Md Humaion Kabir Mehedi, Md Alam, and Annajiat Alim Rasel. 2022. Bidirectional LSTM and NLP based Sentiment Analysis of Tweets.Google ScholarGoogle Scholar
  13. René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2022. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022).Google ScholarGoogle ScholarCross RefCross Ref
  14. Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on 45 (12 1997), 2673 – 2681. https://doi.org/10.1109/78.650093Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Safwan Shaheer, Ishmam Hossain, Sudipta Sarna, Md Humaion Kabir Mehedi, and Annajiat Alim Rasel. 2023. Evaluating Question generation models using QA systems and Semantic Textual Similarity.Google ScholarGoogle Scholar
  16. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  18. Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781–10790.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications
              May 2023
              270 pages
              ISBN:9781450399579
              DOI:10.1145/3605423

              Copyright © 2023 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 20 August 2023

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited
            • Article Metrics

              • Downloads (Last 12 months)16
              • Downloads (Last 6 weeks)2

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format