Dense Captioning of Natural Scenes in Spanish

Gomez-Garay, Alejandro; Raducanu, Bogdan; Salas, Joaquín

doi:10.1007/978-3-319-92198-3_15

Alejandro Gomez-Garay¹⁷,
Bogdan Raducanu¹⁸ &
Joaquín Salas¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10880))

Included in the following conference series:

Mexican Conference on Pattern Recognition

794 Accesses
2 Citations

Abstract

The inclusion of visually impaired people to daily life is a challenging and active area of research. This work studies how to bring information about the surroundings to people delivered as verbal descriptions in Spanish using wearable devices. We use a neural network (DenseCap) for both identifying objects and generating phrases about them. DenseCap is running on a server to describe an image fed from a smartphone application, and its output is the text which a smartphone verbalizes. Our implementation achieves a mean Average Precision (mAP) of 5.0 in object recognition and quality of captions and takes an average of 7.5 s from the moment one grabs a picture until one receives the verbalization in Spanish.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv:1511.03292v1 (2015)
Atkinson, K.: GNU Aspell. http://aspell.net/. Accessed 08 Jan 2018
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Workshop on Statistical Machine Translation (2014)
Google Scholar
Eco, U.: Tratado de semiótica General. Debolsillo, Madrid (2008)
Google Scholar
Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., Hinton, G.: Attend, infer, repeat: fast scene understanding with generative models. arXiv:1603.08575 (2016)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Greene, M., Botros, A., Beck, D., Fei-Fei, L.: What you see is what you expect: rapid scene understanding benefits from prior experience. Attent. Percept. Psychophys. 77(4), 1239–1251 (2015)
Article Google Scholar
Helcl, J., Libovický, J.: CUNI system for the WMT17 multimodal translation task. arXiv:1707.04550 (2017)
Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption translation. arXiv:1601.03916v3 (2016)
Instituto Nacional de Estadística y Geografía: Estadísticas a propósito del día internacional de las personas con discapacidad. http://tinyurl.com/discapacidad. Accessed 15 Dec 2017
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: IEEE CVPR, pp. 4565–4574 (2016)
Google Scholar
Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: IEEE CVPR (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR (2015)
Google Scholar
Kiros, J., Salakhutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539v1 (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Jia-Li, L., Shamma, D., Bernstein, M., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV (2016)
Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: IEEE CVPR (2011)
Google Scholar
Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1549–1557 (2017)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
Leo, M., Medioni, G., Trivedi, M., Kanade, T., Farinella, G.: Computer vision for assistive technologies. Comput. Vis. Image Underst. 154, 1–15 (2017)
Article Google Scholar
Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE CVPR (2009)
Google Scholar
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Conference on Computational Natural Language Learning (2011)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). In: ICLR (2015)
Google Scholar
Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: Annual Meeting of the Association for Computational Linguistics, pp. 1780–1790 (2016)
Google Scholar
Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Elsevier Inc., Amsterdam (2009)
MATH Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556v6 (2015)
Tian, Y., Yang, X., Yi, C., Arditi, A.: Toward a computer vision-based wayfinding aid for blind persons to access unfamiliar indoor environments. Mach. Vis. Appl. 24(3), 521–535 (2013)
Article Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. arXiv:1411.4555v2 (2014)
Wei, Q., Wang, X., Li, X.: Harvesting deep models for cross-lingual image annotation. In: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (2017). http://doi.acm.org/10.1145/3095713.3095751
World Health Organization: global data on visual impairments 2010. https://tinyurl.com/globaldata2010. Accessed 29 Jan 2018
World Health Organization: visual impairment and blindness. http://tinyurl.com/impaired. Accessed 08 Dec 2017
Yao, B., Yang, X., Lin, L., Lee, M., Zhu, S.: I2T: image parsing to text description. Proc. IEEE 98, 1485–1508 (2010)
Article Google Scholar
Yoshikawa, Y., Shigeto, Y., Takeuchi, A.: Stair captions: constructing a large-scale japanese image caption dataset. arXiv:1705.00823v1 (2017)

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU Tesla K40 used for this research. Rodrigo Carrillo, Miguel Torres, and Luis Sáenz developed the Android application. This work was partially funded by SIP-IPN 20180779 for Joaquín Salas. Bogdan Raducanu is supported by Grant No. TIN2016-79717-R, funded by MINECO, Spain. Alejandro Gomez-Garay is supported by Grant No. 434110/618827, funded by CONACyT.

Author information

Authors and Affiliations

Instituto Politécnico Nacional, Querétaro, Mexico
Alejandro Gomez-Garay & Joaquín Salas
Centre de Visió per Computador, Barcelona, Spain
Bogdan Raducanu

Authors

Alejandro Gomez-Garay
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Raducanu
View author publications
You can also search for this author in PubMed Google Scholar
Joaquín Salas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquín Salas .

Editor information

Editors and Affiliations

National Institute of Astrophysics, Optics and Electronics, Sta. Maria Tonantzintla, Puebla, Mexico
José Francisco Martínez-Trinidad
National Institute of Astrophysics, Optics and Electronics, Sta. Maria Tonantzintla, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
Autonomous University of Puebla, Puebla, Puebla, Mexico
José Arturo Olvera-López
University of South Florida, Tampa, Florida, USA
Sudeep Sarkar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomez-Garay, A., Raducanu, B., Salas, J. (2018). Dense Captioning of Natural Scenes in Spanish. In: Martínez-Trinidad, J., Carrasco-Ochoa, J., Olvera-López, J., Sarkar, S. (eds) Pattern Recognition. MCPR 2018. Lecture Notes in Computer Science(), vol 10880. Springer, Cham. https://doi.org/10.1007/978-3-319-92198-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-92198-3_15
Published: 25 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92197-6
Online ISBN: 978-3-319-92198-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)