Abstract
Image captioning is the task of assigning phrases to images describing their visual content. Two main approaches for image captioning are commonly used. On the one hand, traditional approaches assign the captions from the most similar images to the image query. On the other hand, recent methods generate captions by sentence generation systems that learn a joint distribution of captions-images relying on a training set. The main limitation is that both approaches require a great number of manually labeled captioned images. This paper presents a unsupervised approach for image captioning based in a two steps image-textual retrieval process. First, given a query image, visually related words are retrieved from a multimodal indexing. The multimodal indexing is built by using a large dataset of web pages containing images. A vocabulary of words is extracted from web pages, for each word is used the visual representation of images to learn a feature model, in this way we can match query images with words by simply measuring visual similarity. Second, a query is formed with the retrieved words and candidate captions are retrieved from a reference dataset of sentences. Despite the simplicity of our method, it is able to get rid of the need of manually labeled images and instead takes advantage of the noisy data derived from the Web, e.g. web pages. The proposed approach has been evaluated on Generation of Textual Descriptions of Images Task at ImageCLEF 2015. Experimental results show the competitiveness of the proposed approach. In addition we report preliminary results on the use of our method for the auto-illustration task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Both representations require to be normalized, in this work we used L1 normalization.
- 2.
For this case, it uses five human-authored textual descriptions as the gold standard reference.
References
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc. (2012)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NIPS, pp. 1143–1151 (2011)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47, 853–899 (2013)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR (2011)
Calfa, A., Iftene, A.: Using textual and visual processing in scalable concept image annotation challenge. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
Li, X., Jin, Q., Liao, S., Liang, J., He, X., Huo, Y., Lan, W., Xiao, B., Lu, Y., Xu, J.: Ruc-tencent at imageclef 2015: concept detection, localization and sentence generation. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June, pp. 3128–3137 (2015)
Villegas, M., et al.: General overview of ImageCLEF at the CLEF 2015 labs. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 444–461. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24027-5_45
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)
Acknowledgments
This work was supported by CONACYT under project grant CB-2014-241306 (Clasificación y recuperación de imágenes mediante técnicas de minería de textos). The first author was supported by CONACyT under scholarship No. 214764.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Pellegrin, L. et al. (2016). A Two-Step Retrieval Method for Image Captioning. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-44564-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44563-2
Online ISBN: 978-3-319-44564-9
eBook Packages: Computer ScienceComputer Science (R0)