Abstract
Image captioning is a task typically approached by deep encoder-decoder architectures, where the encoder component works on a flat representation of the image while the decoder considers a sequential representation of natural language sentences. As such, these encoder-decoder architectures implement a simple and very specific form of structured transduction, that is a generalization of a predictive problem where the input data and output predictions might have substantially different structures and topologies. In this paper, we explore a generalization of such an approach by addressing the problem as a general structured transduction problem. In particular, we provide a framework that allows considering input and output information with a tree-structured representation. This allows taking into account the hierarchical nature underlying both images and sentences. To this end, we introduce an approach to generate tree-structured representations from images along with an autoencoder working with this kind of data. We empirically assess our approach on both synthetic and realistic tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bacciu, D., Bruno, A.: Deep tree transductions - a short survey. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds.) INNSBDDL 2019. PINNS, vol. 1, pp. 236–245. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16841-4_25
Bacciu, D., Micheli, A., Sperduti, A.: Compositional generative mapping for tree-structured data-part I: bottom-up probabilistic modeling of trees. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012). https://doi.org/10.1109/TNNLS.2012.2222044
Bacciu, D., Micheli, A., Sperduti, A.: An input-output hidden Markov model for tree transductions. Neurocomputing 112, 34–46 (2013)
Cho, K., et al: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Dong, L., Lapata, M.: Language to logical form with neural attention. arXiv preprint arXiv:1601.01280 (2016)
Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9(5), 768–786 (1998). https://doi.org/10.1109/72.712151
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv 2015. arXiv preprint arXiv:1512.03385 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: TreeTalk: composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguist. 2, 351–362 (2014)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, T., Seyedhosseini, M., Tasdizen, T.: Image segmentation using hierarchical merge tree. IEEE Trans. Image Process. 25(10), 4596–4607 (2016)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. arxiv 2015. arXiv preprint arXiv:1512.00567 1512 (2015)
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. corr abs/1411.4555 (2014). arXiv preprint arXiv:1411.4555 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Acknowledgments
This work has been supported by the Italian Ministry of Education, University, and Research (MIUR) under project SIR 2014 LIST-IT (grant n. RBSI14STDE).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bacciu, D., Serramazza, D. (2022). Learning Image Captioning as a Structured Transduction Task. In: Iliadis, L., Jayne, C., Tefas, A., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2022. Communications in Computer and Information Science, vol 1600. Springer, Cham. https://doi.org/10.1007/978-3-031-08223-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-08223-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08222-1
Online ISBN: 978-3-031-08223-8
eBook Packages: Computer ScienceComputer Science (R0)