research-article

Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images

Authors:
Sheikh Ayatur Rahman

Department of Computer Science and Engineering, Brac University, Bangladesh

Department of Computer Science and Engineering, Brac University, Bangladesh

0009-0005-1043-5774
View Profile

,
Albert Boateng

Department of Mathematics and Natural Sciences, Brac University, Ghana

Department of Mathematics and Natural Sciences, Brac University, Ghana

0009-0000-7067-7656
View Profile

,
Sabiha Tahseen

Department of Computer Science and Engineering, Brac University, Bangladesh

Department of Computer Science and Engineering, Brac University, Bangladesh

0009-0003-5162-8244
View Profile

,
Sabbir Hossain

Department of Computer Science and Engineering, Brac University, Bangladesh

Department of Computer Science and Engineering, Brac University, Bangladesh

0000-0002-7243-5754
View Profile

,
Annajiat Alim Rasel

Department of Computer Science and Engineering, Brac University, Bangladesh

Department of Computer Science and Engineering, Brac University, Bangladesh

0000-0003-0198-3734
View Profile

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology ApplicationsMay 2023Pages 86–90https://doi.org/10.1145/3605423.3605427

Published:20 August 2023Publication History

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

Pages 86–90

ABSTRACT

With millions of speakers worldwide, Bangla is one of the most spoken languages in the world and as a result, a large number of people rely on Bangla as their primary medium of communication. Among them, a lot of individuals have visual impairments including but not limited to central vision loss, peripheral vision loss, and blurry vision. This poses several challenges to these individuals - one of them being extracting information from images. While techniques such as image captioning have been proposed to address this issue, visual question answering (VQA) offers a much in-depth and robust way to understand an image. However, there has been a paucity of research into developing a VQA system in Bangla to assist individuals with visual impairments. To reduce this gap, we have introduced a VQA system in Bangla designed to assist visually impaired individuals. VQA is a multifaceted problem, and in this paper we focus on finding the spatial relationships between objects. We have broken down this problem into three sub-tasks: object detection, object counting, and finally relative positioning for the detected objects. The system takes in questions from the user, understands which sub-task to perform and then returns the answer. We have have leveraged several pre-trained models such as Bangla-BERT, EfficientDet-D7, InceptionResNetV2, and MiDas v2.1. The major aspects of this paper are the introduction of a procedurally generated dataset to train models to identify what action to perform based on the prompt of the user and using image segmentations to identify the relative spatial position between objects in all three spatial dimensions.

References

Shafi Ahmed, Md Humaion Kabir Mehedi, Moh Rahman, and Jawad Bin Sayed. 2022. Bangla Music Lyrics Classification. 142–147. https://doi.org/10.1145/3543712.3543752Google ScholarDigital Library
Abhik Bhattacharjee, Tahmid Hasan, Wasi Ahmad, Kazi Samin Mubasshir, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, and Rifat Shahriyar. 2022. BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Findings of the Association for Computational Linguistics: NAACL 2022 (2022). https://doi.org/10.18653/v1/2022.findings-naacl.98Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Brendan Dineen, Rupert Bourne, S Ali, D Huq, and G Johnson. 2003. Prevalence and causes of blindness and visual impairment in Bangladeshi adults: Results of the National Blindness and Low Vision Survey of Bangladesh. The British journal of ophthalmology 87 (07 2003), 820–8. https://doi.org/10.1136/bjo.87.7.820Google ScholarCross Ref
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457–468. https://doi.org/10.18653/v1/D16-1044Google ScholarCross Ref
Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C. H. Hoi, and Xiaogang Wang. 2018. Question-Guided Hybrid Convolution for Visual Question Answering. In Computer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 485–501.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation 9 (12 1997), 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735Google ScholarDigital Library
Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163 (2017), 3–20. https://doi.org/10.1016/j.cviu.2017.06.005 Language in Vision.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.Google ScholarCross Ref
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 289–297.Google ScholarDigital Library
Rakin Mostafa, Md Humaion Kabir Mehedi, Md Alam, and Annajiat Alim Rasel. 2022. Bidirectional LSTM and NLP based Sentiment Analysis of Tweets.Google Scholar
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2022. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022).Google ScholarCross Ref
Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on 45 (12 1997), 2673 – 2681. https://doi.org/10.1109/78.650093Google ScholarDigital Library
Safwan Shaheer, Ishmam Hossain, Sudipta Sarna, Md Humaion Kabir Mehedi, and Annajiat Alim Rasel. 2023. Evaluating Question generation models using QA systems and Semantic Textual Similarity.Google Scholar
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.Google ScholarCross Ref
Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781–10790.Google ScholarCross Ref

Index Terms

Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images

Index terms have been assigned to the content through auto-classification.

Recommendations

Visual question answering: Which investigated applications?
Highlights
- The paper presents concrete applications of Visual Question Answering
- Domains where VQA has been experimented are presented together with the exploited dataset
- The paper suggests some challenging techniques that can be especially ...
Abstract
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Computer Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video summarization, the semantic information is ...
Read More
Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question Answering

Visual question answering (VQA) demands a meticulous and concurrent proficiency in image interpretation and natural language understanding to correctly answer the question about an image. The existing VQA solutions either focus only on improving the ...
Read More
Relational reasoning and adaptive fusion for visual question answering
Abstract
Visual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications
May 2023
270 pages
ISBN:9781450399579
DOI:10.1145/3605423

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bangla NLP
Image Segmentation
Monocular Depth Estimation
Object Detection
Spatial Positioning
Visual Question Answering
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 16
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Visual question answering: Which investigated applications?

Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question Answering

Relational reasoning and adaptive fusion for visual question answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Visual Question Answering in Bangla to assist individuals with visual impairments extract information about objects and their spatial relationships in images

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Visual question answering: Which investigated applications?

Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question Answering

Relational reasoning and adaptive fusion for visual question answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media