Captioning Images Taken by People Who Are Blind

Gurari, Danna; Zhao, Yinan; Zhang, Meng; Bhattacharya, Nilavra

doi:10.1007/978-3-030-58520-4_25

Danna Gurari¹²,
Yinan Zhao¹²,
Meng Zhang¹² &
…
Nilavra Bhattacharya¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12362))

Included in the following conference series:

European Conference on Computer Vision

3868 Accesses
95 Citations

Abstract

While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Article 17 April 2023

Domain-specific image captioning: a comprehensive review

Article 18 April 2024

Towards Unique and Informative Captioning of Images

Notes

1.
Throughout, we use “caption” and “description” interchangeably.
2.
For “yes/no” visual questions, we sampled 50 that have the answer “yes” and another 50 with the answer “no.” For “number” visual questions, we sampled 50 that begin with the question “How many” and another 50 that begin with “How much.” Finally, we randomly sampled another 100 visual questions from the “other” category.
3.
We show parallel analysis in the Supplementary Materials using the proportions of each dataset rather than absolute numbers. For both sets of results, we only show a subset of the 70 scene categories.
4.
https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text.

References

Add alternative text to a shape, picture, chart, SmartArt graphic, or other object. https://support.office.com/en-us/article/add-alternative-text-to-a-shape-picture-chart-smartart-graphic-or-other-object-44989b2a-903c-4d9a-b742-6a75b451c669
BeSpecular. https://www.bespecular.com
Home - Aira: Aira. https://aira.io/
How does automatic alt text work on Facebook? — Facebook Help Center. https://www.facebook.com/help/216219865403298
TapTapSee - Blind and Visually Impaired Assistive Technology - powered by the CloudSight.ai Image Recognition API. https://taptapseeapp.com/
Agrawal, H., et al.: Nocaps: novel object captioning at scale. arXiv preprint arXiv:1812.08658 (2018)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Article Google Scholar
Bennett, C.L., Mott, M.E., Cutrell, E., Morris, M.R.: How teens with visual impairments take, edit, and share photos on social media. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, p. 76. ACM (2018)
Google Scholar
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342. ACM (2010)
Google Scholar
Brady, E., Morris, M.R., Zhong, Y., White, S., Bigham, J.P.: Visual challenges in the everyday lives of blind people. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2117–2126. ACM (2013)
Google Scholar
Burton, M.A., Brady, E., Brewer, R., Neylan, C., Bigham, J.P., Hurst, A.: Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities. In: Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 135–142. ACM (2012)
Google Scholar
Chen, J., Kuznetsova, P., Warren, D., Choi, Y.: Déja image-captions: a corpus of expressive descriptions in repetition. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 504–514 (2015)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chiu, T.-Y., Zhao, Y., Gurari, D.: Assessing image quality issues for real-world problems. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3646–3656 (2020)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)
Google Scholar
Elliott, D., Keller, F.: Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302 (2013)
Google Scholar
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Feng, Y., Lapata, M.: Automatic image annotation using auxiliary text information. In: Proceedings of ACL 2008: HLT, pp. 272–280 (2008)
Google Scholar
Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146 (2017)
Google Scholar
Grubinger, M., Clough, P., Müller, H., Deselaers, T.: The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In: International Workshop OntoImage, vol. 5 (2006)
Google Scholar
Guinness, D., Cutrell, E., Morris, M.R.: Caption crawler: enabling reusable alternative text descriptions using reverse image search. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, p. 518. ACM (2018)
Google Scholar
Gurari, D., et al.: VizWiz-Priv: a dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 939–948 (2019)
Google Scholar
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Google Scholar
Harwath, D., Glass, J.: Deep multimodal semantic embeddings for speech and images. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 237–244. IEEE (2015)
Google Scholar
Havard, W., Besacier, L., Rosec, O.: SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. arXiv preprint arXiv:1707.08435 (2017)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Hossain, M.D., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CSUR) 51(6), 118 (2019)
Article Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.-Y.: Attention on attention for image captioning. In: International Conference on Computer Vision (2019)
Google Scholar
Jas, M., Parikh, D.: Image specificity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2727–2736 (2015)
Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3558–3565 (2014)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). https://doi.org/10.1007/S11263-016-0981-7
Article MathSciNet Google Scholar
Kulkarni, G., et al.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
MacLeod, H., Bennett, C.L., Morris, M.R., Cutrell, E.: Understanding blind people’s experiences with computer-generated captions of social media images. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 5988–5999. ACM (2017)
Google Scholar
Morris, M.R., Johnson, J., Bennett, C.L., Cutrell, E.: Rich representations of visual content for screen reader users. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, p. 59. ACM (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Patterson, G., Hays, J.: Sun attribute database: discovering, annotating, and recognizing scene attributes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2751–2758. IEEE (2012)
Google Scholar
Patterson, G., Hays, J.: COCO attributes: attributes for people, animals, and objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 85–100. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_6
Chapter Google Scholar
Petrie, H., Harrison, C., Dev, S.: Describing images on the web: a survey of current practice and prospects for the future. In: Proceedings of Human Computer Interaction International (HCII), no. 71 (2005)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Association for Computational Linguistics (2010)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Salisbury, E., Kamar, E., Morris, M.R.: Toward scalable social alt text: conversational crowdsourcing as a tool for refining vision-to-language technology for the blind. In: Proceedings of HCOMP 2017 (2017)
Google Scholar
Salisbury, E., Kamar, E., Morris, M.R.: Evaluating and complementing vision-to-language technology for people who are blind with conversational crowdsourcing. In: IJCAI, pp. 5349–5353 (2018)
Google Scholar
Shuster, K., Humeau, S., Hu, H., Bordes, A., Weston, J.: Engaging image captioning via personality. arXiv preprint arXiv:1810.10665 (2018)
Srivastava, G., Srivastava, R.: A survey on automatic image captioning. In: Ghosh, D., Giri, D., Mohapatra, R.N., Savas, E., Sakurai, K., Singh, L.P. (eds.) ICMC 2018. CCIS, vol. 834, pp. 74–83. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-0023-3_8
Chapter Google Scholar
Stangl, A., Morris, M.R., Gurari, D.: “Person, shoes, tree. Is the person naked?” What people with vision impairments want in image descriptions. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2020)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Von Ahn, L., Ginosar, S., Kedia, M., Liu, R., Blum, M.: Improving accessibility of the web with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 79–82. ACM (2006)
Google Scholar
Voykinska, V., Azenkot, S., Wu, S., Leshed, G.: How blind people interact with visual content on social networking services. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 1584–1595. ACM (2016)
Google Scholar
Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: CSCW, pp. 1180–1192 (2017)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492. IEEE (2010)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Yoshikawa, Y., Shigeto, Y., Takeuchi, A.: Stair captions: constructing a large-scale Japanese image caption dataset. arXiv preprint arXiv:1705.00823 (2017)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhong, Y., Lasecki, W.S., Brady, E., Bigham, J.P.: RegionSpeak: quick comprehensive spatial descriptions of complex images for blind users. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 2353–2362. ACM (2015)
Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Google Scholar
Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1681–1688 (2013)
Google Scholar

Download references

Acknowledgements

We thank Meredith Ringel Morris, Ed Cutrell, Neel Joshi, Besmira Nushi, and Kenneth R. Fleischmann for their valuable discussions about this work. We thank Peter Anderson and Harsh Agrawal for sharing their code for setting up the EvalAI evaluation server. We thank the anonymous crowdworkers for providing the annotations. This work is supported by National Science Foundation funding (IIS-1755593), gifts from Microsoft, and gifts from Amazon.

Author information

Authors and Affiliations

University of Texas at Austin, Austin, USA
Danna Gurari, Yinan Zhao, Meng Zhang & Nilavra Bhattacharya

Authors

Danna Gurari
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Meng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Nilavra Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danna Gurari .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 27800 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gurari, D., Zhao, Y., Zhang, M., Bhattacharya, N. (2020). Captioning Images Taken by People Who Are Blind. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58520-4_25
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58519-8
Online ISBN: 978-3-030-58520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics