Skip to main content

Learning Image Captioning as a Structured Transduction Task

  • Conference paper
  • First Online:
Book cover Engineering Applications of Neural Networks (EANN 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1600))

  • 701 Accesses

Abstract

Image captioning is a task typically approached by deep encoder-decoder architectures, where the encoder component works on a flat representation of the image while the decoder considers a sequential representation of natural language sentences. As such, these encoder-decoder architectures implement a simple and very specific form of structured transduction, that is a generalization of a predictive problem where the input data and output predictions might have substantially different structures and topologies. In this paper, we explore a generalization of such an approach by addressing the problem as a general structured transduction problem. In particular, we provide a framework that allows considering input and output information with a tree-structured representation. This allows taking into account the hierarchical nature underlying both images and sentences. To this end, we introduce an approach to generate tree-structured representations from images along with an autoencoder working with this kind of data. We empirically assess our approach on both synthetic and realistic tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/davide-serramazza/image_captionig_tree2tree_input-target_processing.

  2. 2.

    https://github.com/davide-serramazza/image_captionig_tree2tree.

  3. 3.

    https://github.com/davide-serramazza/Geometry-dataset.

References

  1. Bacciu, D., Bruno, A.: Deep tree transductions - a short survey. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds.) INNSBDDL 2019. PINNS, vol. 1, pp. 236–245. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16841-4_25

    Chapter  Google Scholar 

  2. Bacciu, D., Micheli, A., Sperduti, A.: Compositional generative mapping for tree-structured data-part I: bottom-up probabilistic modeling of trees. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012). https://doi.org/10.1109/TNNLS.2012.2222044

    Article  Google Scholar 

  3. Bacciu, D., Micheli, A., Sperduti, A.: An input-output hidden Markov model for tree transductions. Neurocomputing 112, 34–46 (2013)

    Article  Google Scholar 

  4. Cho, K., et al: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  5. Dong, L., Lapata, M.: Language to logical form with neural attention. arXiv preprint arXiv:1601.01280 (2016)

  6. Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9(5), 768–786 (1998). https://doi.org/10.1109/72.712151

    Article  Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv 2015. arXiv preprint arXiv:1512.03385 (2015)

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

    Google Scholar 

  10. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)

    Google Scholar 

  11. Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: TreeTalk: composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguist. 2, 351–362 (2014)

    Article  Google Scholar 

  12. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  13. Liu, T., Seyedhosseini, M., Tasdizen, T.: Image segmentation using hierarchical merge tree. IEEE Trans. Image Process. 25(10), 4596–4607 (2016)

    Article  MathSciNet  Google Scholar 

  14. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  15. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. arxiv 2015. arXiv preprint arXiv:1512.00567 1512 (2015)

  16. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)

  17. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. corr abs/1411.4555 (2014). arXiv preprint arXiv:1411.4555 (2014)

  18. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)

    Article  Google Scholar 

  19. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)

    Google Scholar 

  20. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the Italian Ministry of Education, University, and Research (MIUR) under project SIR 2014 LIST-IT (grant n. RBSI14STDE).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Davide Bacciu or Davide Serramazza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bacciu, D., Serramazza, D. (2022). Learning Image Captioning as a Structured Transduction Task. In: Iliadis, L., Jayne, C., Tefas, A., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2022. Communications in Computer and Information Science, vol 1600. Springer, Cham. https://doi.org/10.1007/978-3-031-08223-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08223-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08222-1

  • Online ISBN: 978-3-031-08223-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics