ABSTRACT
With millions of speakers worldwide, Bangla is one of the most spoken languages in the world and as a result, a large number of people rely on Bangla as their primary medium of communication. Among them, a lot of individuals have visual impairments including but not limited to central vision loss, peripheral vision loss, and blurry vision. This poses several challenges to these individuals - one of them being extracting information from images. While techniques such as image captioning have been proposed to address this issue, visual question answering (VQA) offers a much in-depth and robust way to understand an image. However, there has been a paucity of research into developing a VQA system in Bangla to assist individuals with visual impairments. To reduce this gap, we have introduced a VQA system in Bangla designed to assist visually impaired individuals. VQA is a multifaceted problem, and in this paper we focus on finding the spatial relationships between objects. We have broken down this problem into three sub-tasks: object detection, object counting, and finally relative positioning for the detected objects. The system takes in questions from the user, understands which sub-task to perform and then returns the answer. We have have leveraged several pre-trained models such as Bangla-BERT, EfficientDet-D7, InceptionResNetV2, and MiDas v2.1. The major aspects of this paper are the introduction of a procedurally generated dataset to train models to identify what action to perform based on the prompt of the user and using image segmentations to identify the relative spatial position between objects in all three spatial dimensions.
- Shafi Ahmed, Md Humaion Kabir Mehedi, Moh Rahman, and Jawad Bin Sayed. 2022. Bangla Music Lyrics Classification. 142–147. https://doi.org/10.1145/3543712.3543752Google ScholarDigital Library
- Abhik Bhattacharjee, Tahmid Hasan, Wasi Ahmad, Kazi Samin Mubasshir, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, and Rifat Shahriyar. 2022. BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Findings of the Association for Computational Linguistics: NAACL 2022 (2022). https://doi.org/10.18653/v1/2022.findings-naacl.98Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Brendan Dineen, Rupert Bourne, S Ali, D Huq, and G Johnson. 2003. Prevalence and causes of blindness and visual impairment in Bangladeshi adults: Results of the National Blindness and Low Vision Survey of Bangladesh. The British journal of ophthalmology 87 (07 2003), 820–8. https://doi.org/10.1136/bjo.87.7.820Google ScholarCross Ref
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457–468. https://doi.org/10.18653/v1/D16-1044Google ScholarCross Ref
- Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C. H. Hoi, and Xiaogang Wang. 2018. Question-Guided Hybrid Convolution for Visual Question Answering. In Computer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 485–501.Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation 9 (12 1997), 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735Google ScholarDigital Library
- Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163 (2017), 3–20. https://doi.org/10.1016/j.cviu.2017.06.005 Language in Vision.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.Google ScholarCross Ref
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 289–297.Google ScholarDigital Library
- Rakin Mostafa, Md Humaion Kabir Mehedi, Md Alam, and Annajiat Alim Rasel. 2022. Bidirectional LSTM and NLP based Sentiment Analysis of Tweets.Google Scholar
- René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2022. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022).Google ScholarCross Ref
- Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on 45 (12 1997), 2673 – 2681. https://doi.org/10.1109/78.650093Google ScholarDigital Library
- Safwan Shaheer, Ishmam Hossain, Sudipta Sarna, Md Humaion Kabir Mehedi, and Annajiat Alim Rasel. 2023. Evaluating Question generation models using QA systems and Semantic Textual Similarity.Google Scholar
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.Google ScholarCross Ref
- Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781–10790.Google ScholarCross Ref
Index Terms
- Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images
Recommendations
Visual question answering: Which investigated applications?
Highlights- The paper presents concrete applications of Visual Question Answering
- Domains where VQA has been experimented are presented together with the exploited dataset
- The paper suggests some challenging techniques that can be especially ...
AbstractVisual Question Answering (VQA) is an extremely stimulating and challenging research area where Computer Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video summarization, the semantic information is ...
Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question Answering
Visual question answering (VQA) demands a meticulous and concurrent proficiency in image interpretation and natural language understanding to correctly answer the question about an image. The existing VQA solutions either focus only on improving the ...
Relational reasoning and adaptive fusion for visual question answering
AbstractVisual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual ...
Comments