Skip to main content

Dense Captioning of Natural Scenes in Spanish

  • Conference paper
  • First Online:
Pattern Recognition (MCPR 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10880))

Included in the following conference series:

Abstract

The inclusion of visually impaired people to daily life is a challenging and active area of research. This work studies how to bring information about the surroundings to people delivered as verbal descriptions in Spanish using wearable devices. We use a neural network (DenseCap) for both identifying objects and generating phrases about them. DenseCap is running on a server to describe an image fed from a smartphone application, and its output is the text which a smartphone verbalizes. Our implementation achieves a mean Average Precision (mAP) of 5.0 in object recognition and quality of captions and takes an average of 7.5 s from the moment one grabs a picture until one receives the verbalization in Spanish.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv:1511.03292v1 (2015)

  2. Atkinson, K.: GNU Aspell. http://aspell.net/. Accessed 08 Jan 2018

  3. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Workshop on Statistical Machine Translation (2014)

    Google Scholar 

  4. Eco, U.: Tratado de semiótica General. Debolsillo, Madrid (2008)

    Google Scholar 

  5. Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., Hinton, G.: Attend, infer, repeat: fast scene understanding with generative models. arXiv:1603.08575 (2016)

  6. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)

    Article  Google Scholar 

  7. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2

    Chapter  Google Scholar 

  8. Greene, M., Botros, A., Beck, D., Fei-Fei, L.: What you see is what you expect: rapid scene understanding benefits from prior experience. Attent. Percept. Psychophys. 77(4), 1239–1251 (2015)

    Article  Google Scholar 

  9. Helcl, J., Libovický, J.: CUNI system for the WMT17 multimodal translation task. arXiv:1707.04550 (2017)

  10. Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption translation. arXiv:1601.03916v3 (2016)

  11. Instituto Nacional de Estadística y Geografía: Estadísticas a propósito del día internacional de las personas con discapacidad. http://tinyurl.com/discapacidad. Accessed 15 Dec 2017

  12. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: IEEE CVPR, pp. 4565–4574 (2016)

    Google Scholar 

  13. Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: IEEE CVPR (2015)

    Google Scholar 

  14. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR (2015)

    Google Scholar 

  15. Kiros, J., Salakhutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539v1 (2014)

  16. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Jia-Li, L., Shamma, D., Bernstein, M., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV (2016)

    Google Scholar 

  17. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: IEEE CVPR (2011)

    Google Scholar 

  18. Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1549–1557 (2017)

    Google Scholar 

  19. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  20. Leo, M., Medioni, G., Trivedi, M., Kanade, T., Farinella, G.: Computer vision for assistive technologies. Comput. Vis. Image Underst. 154, 1–15 (2017)

    Article  Google Scholar 

  21. Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE CVPR (2009)

    Google Scholar 

  22. Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Conference on Computational Natural Language Learning (2011)

    Google Scholar 

  23. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). In: ICLR (2015)

    Google Scholar 

  24. Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: Annual Meeting of the Association for Computational Linguistics, pp. 1780–1790 (2016)

    Google Scholar 

  25. Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Elsevier Inc., Amsterdam (2009)

    MATH  Google Scholar 

  26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556v6 (2015)

  27. Tian, Y., Yang, X., Yi, C., Arditi, A.: Toward a computer vision-based wayfinding aid for blind persons to access unfamiliar indoor environments. Mach. Vis. Appl. 24(3), 521–535 (2013)

    Article  Google Scholar 

  28. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. arXiv:1411.4555v2 (2014)

  29. Wei, Q., Wang, X., Li, X.: Harvesting deep models for cross-lingual image annotation. In: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (2017). http://doi.acm.org/10.1145/3095713.3095751

  30. World Health Organization: global data on visual impairments 2010. https://tinyurl.com/globaldata2010. Accessed 29 Jan 2018

  31. World Health Organization: visual impairment and blindness. http://tinyurl.com/impaired. Accessed 08 Dec 2017

  32. Yao, B., Yang, X., Lin, L., Lee, M., Zhu, S.: I2T: image parsing to text description. Proc. IEEE 98, 1485–1508 (2010)

    Article  Google Scholar 

  33. Yoshikawa, Y., Shigeto, Y., Takeuchi, A.: Stair captions: constructing a large-scale japanese image caption dataset. arXiv:1705.00823v1 (2017)

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU Tesla K40 used for this research. Rodrigo Carrillo, Miguel Torres, and Luis Sáenz developed the Android application. This work was partially funded by SIP-IPN 20180779 for Joaquín Salas. Bogdan Raducanu is supported by Grant No. TIN2016-79717-R, funded by MINECO, Spain. Alejandro Gomez-Garay is supported by Grant No. 434110/618827, funded by CONACyT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquín Salas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gomez-Garay, A., Raducanu, B., Salas, J. (2018). Dense Captioning of Natural Scenes in Spanish. In: Martínez-Trinidad, J., Carrasco-Ochoa, J., Olvera-López, J., Sarkar, S. (eds) Pattern Recognition. MCPR 2018. Lecture Notes in Computer Science(), vol 10880. Springer, Cham. https://doi.org/10.1007/978-3-319-92198-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92198-3_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92197-6

  • Online ISBN: 978-3-319-92198-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics