Bottom-Up Transformer Reasoning Network for Text-Image Retrieval

Yang, Zonghao; Zhou, Yue; Chen, Ao

doi:10.1007/978-981-99-1645-0_15

Zonghao Yang¹⁰,
Yue Zhou¹⁰ &
Ao Chen¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

International Conference on Neural Information Processing

812 Accesses

Abstract

Image-text retrieval is a complicated and challenging task in the cross-modality area, and lots of experiments have made great progress. Most existing researches process images and text in one pipeline or are highly entangled, which is not practical and human-friendly in the real world. Moreover, the image regions extracted by Faster-RCNN are highly over-sampled in the image pipeline, which causes ambiguities for the extracted visual embeddings. From this point of view, we introduce the Bottom-up Transformer Reasoning Network (BTRN). Our method is built upon the transformer encoders to process the image and text separately. We also embed the tag information generated by Faster-RCNN to strengthen the connection between the two modalities. Recall at K and normalized discounted cumulative gain metric (NDCG) metrics are used to evaluate our model. Through various experiments, we prove our model can reach state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and vqa (2017)
Google Scholar
Carrara, F., Esuli, A., Fagni, T., Falchi, F., Moreo Fernández, A.: Picture it in your mind: generating high level visual representations from textual descriptions. Inf. Retrieval J, 208–229 (2017). https://doi.org/10.1007/s10791-017-9318-6
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Google Scholar
Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
Google Scholar
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference (2018)
Google Scholar
Ge, X., Chen, F., Jose, J.M., Ji, Z., Wu, Z., Liu, X.: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 664–676 (2016)
Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Li, F.F.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1) (2017)
Google Scholar
Lee, K.H., Xi, C., Gang, H., Hu, H., He, X.: Stacked cross attention for image-text matching (2018)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks (2020)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (2004)
Google Scholar
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (2014)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks (2019)
Google Scholar
Macavaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Efficient document re-ranking for transformers by precomputing term representations. In: arXiv (2020)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval (2020)
Google Scholar
MessinaNicola, AmatoGiuseppe, EsuliAndrea, FalchiFabrizio, GennaroClaudio, Marchand-MailletStéphane: fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021)
Google Scholar
Nguyen, M.D., Nguyen, B.T., Gurrin, C.: A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Google Scholar
Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: arXiv (2017)
Google Scholar
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2
Chapter Google Scholar
Wang, Y., Yang, H., Qian, X., Ma, L., Fan, X.: Position focused attention network for image-text matching (2019)
Google Scholar
Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: Camp: cross-modal adaptive message passing for text-image retrieval. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2020)
Google Scholar
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Networks Learn. Syst. 31(12), 5412–5425 (2020)
Article Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: European Conference on Computer Vision (2018)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu (2014)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai JiaoTong University, Shanghai, China
Zonghao Yang, Yue Zhou & Ao Chen

Authors

Zonghao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ao Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Zhou .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Zhou, Y., Chen, A. (2023). Bottom-Up Transformer Reasoning Network for Text-Image Retrieval. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_15

Download citation

DOI: https://doi.org/10.1007/978-981-99-1645-0_15
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics