What we see in a photograph: content selection for image captioning

Barlas, Georgios; Veinidis, Christos; Arampatzis, Avi

doi:10.1007/s00371-020-01867-9

What we see in a photograph: content selection for image captioning

Original Article
Published: 10 July 2020

Volume 37, pages 1309–1326, (2021)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Georgios Barlas¹,
Christos Veinidis¹ &
Avi Arampatzis¹

388 Accesses
7 Citations
Explore all metrics

Abstract

We propose and experimentally investigate the usefulness of several features for selecting image content (objects) suitable for image captioning. The approach taken explores three broad categories of features, namely geometric, conceptual, and visual. Experiments suggest that widely known geometric ‘rules’ in art–aesthetics or photography (such as the golden ratio or the rule-of-thirds) and facts about the human visual system (such as its wider horizontal angle than its vertical) provide no useful information for the task. Human captioners seem to prefer large, elongated (but not in the golden ratio) objects, positioned near the image center, irrespective of orientation. Conceptually, the preferred objects are either too specific or too general, and animate things are almost always mentioned; furthermore, some evidence is found for selecting diverse objects in order to achieve maximal image coverage in captions. Visual object features such as saliency, depth, edges, entropy, and contrast, are all found to provide useful information. Beyond evaluating features in isolation, we investigate how well these are combined by performing feature and feature-category ablation studies, leading to an effective set of features which can be proven useful for operational systems. Moreover, we propose alternative ways for feature engineering and evaluation, dealing with the drawbacks of the evaluation methodology proposed in past literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Unique and Informative Captioning of Images

Captioning Images Taken by People Who Are Blind

A Survey on Attention-Based Image Captioning: Taxonomy, Challenges, and Future Perspectives

Notes

https://en.wikipedia.org/wiki/Word2vec.
Note that there may be more than one bounding boxes per concept, e.g., two women in an image. While in Sect. 3.1 by i we denote a concept, in Sects. 3.2 and 3.3 we will use i for a bounding box.
https://digital-photography-school.com/rule-of-thirds/.
http://www.photographymad.com/pages/view/rule-of-thirds.
This strategy is not arbitrary and has a parallel in information retrieval evaluation. For example, evaluating a text document for its relevance to an information need—even when the last is expressed by more than just a query but with title, description, and narrative (called ‘topic’ in NIST’s annual Text REtrieval Conference or TREC jargon)—is notoriously subjective with a modest agreement across human evaluators. In this respect, TREC evaluations have typically used a majority voting, e.g., if two out of three judges say a document is relevant, then it is taken as relevant.

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998 (2017)
Arampatzis, A., Drosatos, G., Efraimidis, P.S.: Versatile query scrambling for private web search. Inf. Retr. J. 18(4), 331–358 (2015)
Article Google Scholar
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Article Google Scholar
Barlas, G., Ntonti, M., Arampatzis, A.: Duth at the ImageCLEF 2016 image annotation task: content selection. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016, pp. 279–287 (2016)
Bejan, A.: The golden ratio predicted: vision, cognition and locomotion as a single designin nature. Int. J. Des. Nat. Ecodyn. 4, 97–104 (2009)
Article MathSciNet Google Scholar
Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, pp. 663–676. Springer, Berlin (2010)
Campbell, F.W., Robson, J.: Application of fourier analysis to the visibility of gratings. J. Physiol. 197(3), 551–566 (1968)
Article Google Scholar
Chen, X., Zitnick, C.L.: Mind’s eye: A recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431 (2015)
Cheng, M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)
Article Google Scholar
Cristea, A., Iftene, A.: Using machine learning techniques, textual and visual processing in scalable concept image annotation challenge. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016. CEUR Workshop Proceedings, vol. 1609, pp. 288–298. CEUR-WS.org (2016)
Dagnelie, G.: Visual Prosthetics: Physiology, Bioengineering, Rehabilitation. Springer, Berlin (2011). https://doi.org/10.1007/978-1-4419-0754-7
Book Google Scholar
Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Computer Vision—ECCV 2006, pp. 288–301. Springer, Berlin (2006)
Chapter Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. 2009 IEEE Conference on Computer Vision and Pattern Recognition pp. 1778–1785 (2009)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Computer Vision—ECCV 2010, pp. 15–29. Springer, Berlin (2010)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Article Google Scholar
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. CoRR arXiv:1611.08002 (2016)
Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2016 scalable concept image annotation task. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation Forum, Évora, Portugal, 5–8 September, 2016, pp. 254–278 (2016)
Gupta, A., Davis, L.S.: Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In: Computer Vision—ECCV 2008, pp. 16–29. Springer, Berlin (2008)
Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press, Cambridge (1998)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR arXiv:cmp-lg/9709008 (1997)
Klein, S.A., Carney, T., Barghout-Stein, L., Tyler, C.W.: Seven models of masking. In: Human Vision and Electronic Imaging II, vol. 3016, pp. 13–25. International Society for Optics and Photonics (1997)
Kreyszig, E.: Advanced Engineering Mathematics: Maple Computer Guide, 8th edn. Wiley, New York (2000)
Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Leacock, C., Chodorow, M., Miller, G.A.: Using corpus statistics and wordnet relations for sense identification. Comput. Linguist. 24(1), 147–165 (1998)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 296–304. Morgan Kaufmann (1998)
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR arXiv:1405.0312 (2014)
Liu, A., Xu, N., Zhang, H., Nie, W., Su, Y., Zhang, Y.: Multi-level policy and reward reinforcement learning for image captioning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 821–827. International Joint Conferences on Artificial Intelligence Organization (2018)
Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical and multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)
Article Google Scholar
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015)
Article Google Scholar
Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35, 445–470 (2019)
Article Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. CoRR arXiv:1803.09845 (2018)
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)
Article MathSciNet Google Scholar
Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2623–2631 (2015)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. CoRR arXiv:1109.2378 (2011)
Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., Xu, C.: Learning explicit video attributes from mid-level representation for video captioning. Computer Vis. Image Underst. 163, 126–138 (2017)
Article Google Scholar
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Article Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. CoRR arXiv:1505.04870 (2015)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, IJCAI 95, Montréal Québec, Canada, August 20–25, vol. 1, pp. 448–453 (1995)
Shapley, R.M., Tolhurst, D.J.: Edge detectors in human vision. J. Physiol. 229(1), 165–183 (1973)
Article Google Scholar
Sobel, I., Feldman, G.: A 3x3 isotropic gradient operator for image processing. A talk at the Stanford Artificial Project in pp. 271–272 (1968)
Villegas, M., Müller, H., García Seco de Herrera, A., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas, R., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J.A., Vidal, E.: General Overview of ImageCLEF at the CLEF 2016 Labs, pp. 267–285. Springer, Cham (2016)
Villegas, M., Müller, H., de Herrera, A.G.S., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J., Vidal, E.: General overview of ImageCLEF at the CLEF 2016 labs. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction—7th International Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5–8, 2016, Proceedings, pp. 267–285 (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)
Wang, J., Gaizauskas, R.: Don’t mention the shoe! a learning to rank approach to content selection for image description generation. In: Proceedings of the 9th International Natural Language Generation conference. Association for Computational Linguistics (ACL) (2016)
Wang, J., Gaizauskas, R.J.: Generating image descriptions with gold standard visual inputs: Motivation, evaluation and baselines. In: ENLG 2015 - Proceedings of the 15th European Workshop on Natural Language Generation, 10–11 September 2015, University of Brighton, Brighton, UK, pp. 117–126. The Association for Computer Linguistics (2015)
Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.R.: Image captioning with an intermediate attributes layer. CoRR arXiv:1506.01144 (2015)
Wu, Z., Palmer, M.S.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, 27–30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings, pp. 133–138. Morgan Kaufmann Publishers/ACL (1994)
Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention lstm networks for video captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, pp. 537–545. ACM (2017)
Xu, N., Liu, A.A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Technol. PP, 1–1 (2018)
Yang, Y., Teo, C.L., III, H.D., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 444–454. ACL (2011)

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100, Xanthi, Greece
Georgios Barlas, Christos Veinidis & Avi Arampatzis

Authors

Georgios Barlas
View author publications
You can also search for this author in PubMed Google Scholar
Christos Veinidis
View author publications
You can also search for this author in PubMed Google Scholar
Avi Arampatzis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Veinidis.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barlas, G., Veinidis, C. & Arampatzis, A. What we see in a photograph: content selection for image captioning. Vis Comput 37, 1309–1326 (2021). https://doi.org/10.1007/s00371-020-01867-9

Download citation

Published: 10 July 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00371-020-01867-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What we see in a photograph: content selection for image captioning

Abstract

Access this article

Similar content being viewed by others

Towards Unique and Informative Captioning of Images

Captioning Images Taken by People Who Are Blind

A Survey on Attention-Based Image Captioning: Taxonomy, Challenges, and Future Perspectives

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

What we see in a photograph: content selection for image captioning

Abstract

Access this article

Similar content being viewed by others

Towards Unique and Informative Captioning of Images

Captioning Images Taken by People Who Are Blind

A Survey on Attention-Based Image Captioning: Taxonomy, Challenges, and Future Perspectives

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation