A Two-Step Retrieval Method for Image Captioning

Pellegrin, Luis; Vanegas, Jorge A.; Arevalo, John; Beltrán, Viviana; Escalante, Hugo Jair; Montes-y-Gómez, Manuel; González, Fabio A.

doi:10.1007/978-3-319-44564-9_12

Luis Pellegrin²¹,
Jorge A. Vanegas²²,
John Arevalo²²,
Viviana Beltrán²²,
Hugo Jair Escalante²¹,
Manuel Montes-y-Gómez²¹ &
…
Fabio A. González²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9822))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1072 Accesses
1 Citations

Abstract

Image captioning is the task of assigning phrases to images describing their visual content. Two main approaches for image captioning are commonly used. On the one hand, traditional approaches assign the captions from the most similar images to the image query. On the other hand, recent methods generate captions by sentence generation systems that learn a joint distribution of captions-images relying on a training set. The main limitation is that both approaches require a great number of manually labeled captioned images. This paper presents a unsupervised approach for image captioning based in a two steps image-textual retrieval process. First, given a query image, visually related words are retrieved from a multimodal indexing. The multimodal indexing is built by using a large dataset of web pages containing images. A vocabulary of words is extracted from web pages, for each word is used the visual representation of images to learn a feature model, in this way we can match query images with words by simply measuring visual similarity. Second, a query is formed with the retrieved words and candidate captions are retrieved from a reference dataset of sentences. Despite the simplicity of our method, it is able to get rid of the need of manually labeled images and instead takes advantage of the noisy data derived from the Web, e.g. web pages. The proposed approach has been evaluated on Generation of Textual Descriptions of Images Task at ImageCLEF 2015. Experimental results show the competitiveness of the proposed approach. In addition we report preliminary results on the use of our method for the auto-illustration task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Both representations require to be normalized, in this work we used L1 normalization.
2.
For this case, it uses five human-authored textual descriptions as the gold standard reference.

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc. (2012)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NIPS, pp. 1143–1151 (2011)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47, 853–899 (2013)
MathSciNet MATH Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)
Chapter Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR (2011)
Google Scholar
Calfa, A., Iftene, A.: Using textual and visual processing in scalable concept image annotation challenge. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
MathSciNet MATH Google Scholar
Li, X., Jin, Q., Liao, S., Liang, J., He, X., Huo, Y., Lan, W., Xiao, B., Lu, Y., Xu, J.: Ruc-tencent at imageclef 2015: concept detection, localization and sentence generation. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)
Google Scholar
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June, pp. 3128–3137 (2015)
Google Scholar
Villegas, M., et al.: General overview of ImageCLEF at the CLEF 2015 labs. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 444–461. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24027-5_45
Chapter Google Scholar
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)
Google Scholar

Download references

Acknowledgments

This work was supported by CONACYT under project grant CB-2014-241306 (Clasificación y recuperación de imágenes mediante técnicas de minería de textos). The first author was supported by CONACyT under scholarship No. 214764.

Author information

Authors and Affiliations

Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Tonantzintla, Mexico
Luis Pellegrin, Hugo Jair Escalante & Manuel Montes-y-Gómez
MindLab Research Group, Universidad Nacional de Colombia (UNAL), Bogotá, Colombia
Jorge A. Vanegas, John Arevalo, Viviana Beltrán & Fabio A. González

Authors

Luis Pellegrin
View author publications
You can also search for this author in PubMed Google Scholar
Jorge A. Vanegas
View author publications
You can also search for this author in PubMed Google Scholar
John Arevalo
View author publications
You can also search for this author in PubMed Google Scholar
Viviana Beltrán
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Jair Escalante
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Fabio A. González
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Pellegrin .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Universidade de Évora , Évora, Portugal
Paulo Quaresma
University of Évora , Évora, Portugal
Teresa Gonçalves
Aalborg University Copenhagen , Copenhagen, Denmark
Birger Larsen
University of Stavanger , Stavanger, Norway
Krisztian Balog
University of Glasgow , Glasgow, United Kingdom
Craig Macdonald
University of Padua , Padua, Italy
Linda Cappellato
University of Padua , Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pellegrin, L. et al. (2016). A Two-Step Retrieval Method for Image Captioning. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-44564-9_12
Published: 23 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44563-2
Online ISBN: 978-3-319-44564-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics