Abstract
Image-text retrieval is a complicated and challenging task in the cross-modality area, and lots of experiments have made great progress. Most existing researches process images and text in one pipeline or are highly entangled, which is not practical and human-friendly in the real world. Moreover, the image regions extracted by Faster-RCNN are highly over-sampled in the image pipeline, which causes ambiguities for the extracted visual embeddings. From this point of view, we introduce the Bottom-up Transformer Reasoning Network (BTRN). Our method is built upon the transformer encoders to process the image and text separately. We also embed the tag information generated by Faster-RCNN to strengthen the connection between the two modalities. Recall at K and normalized discounted cumulative gain metric (NDCG) metrics are used to evaluate our model. Through various experiments, we prove our model can reach state-of-the-art results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and vqa (2017)
Carrara, F., Esuli, A., Fagni, T., Falchi, F., Moreo Fernández, A.: Picture it in your mind: generating high level visual representations from textual descriptions. Inf. Retrieval J, 208–229 (2017). https://doi.org/10.1007/s10791-017-9318-6
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching (2021)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference (2018)
Ge, X., Chen, F., Jose, J.M., Ji, Z., Wu, Z., Liu, X.: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 664–676 (2016)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Li, F.F.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1) (2017)
Lee, K.H., Xi, C., Gang, H., Hu, H., He, X.: Stacked cross attention for image-text matching (2018)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks (2020)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (2004)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (2014)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks (2019)
Macavaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Efficient document re-ranking for transformers by precomputing term representations. In: arXiv (2020)
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval (2020)
MessinaNicola, AmatoGiuseppe, EsuliAndrea, FalchiFabrizio, GennaroClaudio, Marchand-MailletStéphane: fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021)
Nguyen, M.D., Nguyen, B.T., Gurrin, C.: A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers (2019)
Vaswani, A., et al.: Attention is all you need. In: arXiv (2017)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2
Wang, Y., Yang, H., Qian, X., Ma, L., Fan, X.: Position focused attention network for image-text matching (2019)
Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: Camp: cross-modal adaptive message passing for text-image retrieval. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2020)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Networks Learn. Syst. 31(12), 5412–5425 (2020)
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: European Conference on Computer Vision (2018)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu (2014)
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, Z., Zhou, Y., Chen, A. (2023). Bottom-Up Transformer Reasoning Network for Text-Image Retrieval. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_15
Download citation
DOI: https://doi.org/10.1007/978-981-99-1645-0_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)