Skip to main content
Log in

What we see in a photograph: content selection for image captioning

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

We propose and experimentally investigate the usefulness of several features for selecting image content (objects) suitable for image captioning. The approach taken explores three broad categories of features, namely geometric, conceptual, and visual. Experiments suggest that widely known geometric ‘rules’ in art–aesthetics or photography (such as the golden ratio or the rule-of-thirds) and facts about the human visual system (such as its wider horizontal angle than its vertical) provide no useful information for the task. Human captioners seem to prefer large, elongated (but not in the golden ratio) objects, positioned near the image center, irrespective of orientation. Conceptually, the preferred objects are either too specific or too general, and animate things are almost always mentioned; furthermore, some evidence is found for selecting diverse objects in order to achieve maximal image coverage in captions. Visual object features such as saliency, depth, edges, entropy, and contrast, are all found to provide useful information. Beyond evaluating features in isolation, we investigate how well these are combined by performing feature and feature-category ablation studies, leading to an effective set of features which can be proven useful for operational systems. Moreover, we propose alternative ways for feature engineering and evaluation, dealing with the drawbacks of the evaluation methodology proposed in past literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/Word2vec.

  2. Note that there may be more than one bounding boxes per concept, e.g., two women in an image. While in Sect. 3.1 by i we denote a concept, in Sects. 3.2 and 3.3 we will use i for a bounding box.

  3. https://digital-photography-school.com/rule-of-thirds/.

  4. http://www.photographymad.com/pages/view/rule-of-thirds.

  5. This strategy is not arbitrary and has a parallel in information retrieval evaluation. For example, evaluating a text document for its relevance to an information need—even when the last is expressed by more than just a query but with title, description, and narrative (called ‘topic’ in NIST’s annual Text REtrieval Conference or TREC jargon)—is notoriously subjective with a modest agreement across human evaluators. In this respect, TREC evaluations have typically used a majority voting, e.g., if two out of three judges say a document is relevant, then it is taken as relevant.

References

  1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998 (2017)

  2. Arampatzis, A., Drosatos, G., Efraimidis, P.S.: Versatile query scrambling for private web search. Inf. Retr. J. 18(4), 331–358 (2015)

    Article  Google Scholar 

  3. Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)

    Article  Google Scholar 

  4. Barlas, G., Ntonti, M., Arampatzis, A.: Duth at the ImageCLEF 2016 image annotation task: content selection. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016, pp. 279–287 (2016)

  5. Bejan, A.: The golden ratio predicted: vision, cognition and locomotion as a single designin nature. Int. J. Des. Nat. Ecodyn. 4, 97–104 (2009)

    Article  MathSciNet  Google Scholar 

  6. Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, pp. 663–676. Springer, Berlin (2010)

  7. Campbell, F.W., Robson, J.: Application of fourier analysis to the visibility of gratings. J. Physiol. 197(3), 551–566 (1968)

    Article  Google Scholar 

  8. Chen, X., Zitnick, C.L.: Mind’s eye: A recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431 (2015)

  9. Cheng, M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)

    Article  Google Scholar 

  10. Cristea, A., Iftene, A.: Using machine learning techniques, textual and visual processing in scalable concept image annotation challenge. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016. CEUR Workshop Proceedings, vol. 1609, pp. 288–298. CEUR-WS.org (2016)

  11. Dagnelie, G.: Visual Prosthetics: Physiology, Bioengineering, Rehabilitation. Springer, Berlin (2011). https://doi.org/10.1007/978-1-4419-0754-7

    Book  Google Scholar 

  12. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Computer Vision—ECCV 2006, pp. 288–301. Springer, Berlin (2006)

    Chapter  Google Scholar 

  13. Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015)

  14. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. 2009 IEEE Conference on Computer Vision and Pattern Recognition pp. 1778–1785 (2009)

  15. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Computer Vision—ECCV 2010, pp. 15–29. Springer, Berlin (2010)

  16. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)

    Article  Google Scholar 

  17. Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. CoRR arXiv:1611.08002 (2016)

  18. Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2016 scalable concept image annotation task. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation Forum, Évora, Portugal, 5–8 September, 2016, pp. 254–278 (2016)

  19. Gupta, A., Davis, L.S.: Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In: Computer Vision—ECCV 2008, pp. 16–29. Springer, Berlin (2008)

  20. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press, Cambridge (1998)

  21. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    Article  MathSciNet  Google Scholar 

  22. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR arXiv:cmp-lg/9709008 (1997)

  23. Klein, S.A., Carney, T., Barghout-Stein, L., Tyler, C.W.: Seven models of masking. In: Human Vision and Electronic Imaging II, vol. 3016, pp. 13–25. International Society for Optics and Photonics (1997)

  24. Kreyszig, E.: Advanced Engineering Mathematics: Maple Computer Guide, 8th edn. Wiley, New York (2000)

    Google Scholar 

  25. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)

    Article  Google Scholar 

  26. Leacock, C., Chodorow, M., Miller, G.A.: Using corpus statistics and wordnet relations for sense identification. Comput. Linguist. 24(1), 147–165 (1998)

    Google Scholar 

  27. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 296–304. Morgan Kaufmann (1998)

  28. Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR arXiv:1405.0312 (2014)

  29. Liu, A., Xu, N., Zhang, H., Nie, W., Su, Y., Zhang, Y.: Multi-level policy and reward reinforcement learning for image captioning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 821–827. International Joint Conferences on Artificial Intelligence Organization (2018)

  30. Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical and multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)

    Article  Google Scholar 

  31. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015)

    Article  Google Scholar 

  32. Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35, 445–470 (2019)

    Article  Google Scholar 

  33. Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. CoRR arXiv:1803.09845 (2018)

  34. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)

    Article  MathSciNet  Google Scholar 

  35. Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2623–2631 (2015)

  36. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  37. Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. CoRR arXiv:1109.2378 (2011)

  38. Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., Xu, C.: Learning explicit video attributes from mid-level representation for video captioning. Computer Vis. Image Underst. 163, 126–138 (2017)

    Article  Google Scholar 

  39. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  Google Scholar 

  40. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. CoRR arXiv:1505.04870 (2015)

  41. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, IJCAI 95, Montréal Québec, Canada, August 20–25, vol. 1, pp. 448–453 (1995)

  42. Shapley, R.M., Tolhurst, D.J.: Edge detectors in human vision. J. Physiol. 229(1), 165–183 (1973)

    Article  Google Scholar 

  43. Sobel, I., Feldman, G.: A 3x3 isotropic gradient operator for image processing. A talk at the Stanford Artificial Project in pp. 271–272 (1968)

  44. Villegas, M., Müller, H., García Seco de Herrera, A., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas, R., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J.A., Vidal, E.: General Overview of ImageCLEF at the CLEF 2016 Labs, pp. 267–285. Springer, Cham (2016)

  45. Villegas, M., Müller, H., de Herrera, A.G.S., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J., Vidal, E.: General overview of ImageCLEF at the CLEF 2016 labs. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction—7th International Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5–8, 2016, Proceedings, pp. 267–285 (2016)

  46. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)

  47. Wang, J., Gaizauskas, R.: Don’t mention the shoe! a learning to rank approach to content selection for image description generation. In: Proceedings of the 9th International Natural Language Generation conference. Association for Computational Linguistics (ACL) (2016)

  48. Wang, J., Gaizauskas, R.J.: Generating image descriptions with gold standard visual inputs: Motivation, evaluation and baselines. In: ENLG 2015 - Proceedings of the 15th European Workshop on Natural Language Generation, 10–11 September 2015, University of Brighton, Brighton, UK, pp. 117–126. The Association for Computer Linguistics (2015)

  49. Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.R.: Image captioning with an intermediate attributes layer. CoRR arXiv:1506.01144 (2015)

  50. Wu, Z., Palmer, M.S.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, 27–30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings, pp. 133–138. Morgan Kaufmann Publishers/ACL (1994)

  51. Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention lstm networks for video captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, pp. 537–545. ACM (2017)

  52. Xu, N., Liu, A.A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Technol. PP, 1–1 (2018)

  53. Yang, Y., Teo, C.L., III, H.D., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 444–454. ACL (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christos Veinidis.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barlas, G., Veinidis, C. & Arampatzis, A. What we see in a photograph: content selection for image captioning. Vis Comput 37, 1309–1326 (2021). https://doi.org/10.1007/s00371-020-01867-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-020-01867-9

Keywords

Navigation