Abstract
Bidirectional visual-text retrieval task has aroused interest of many researchers in the field of computer vision. In this paper, an end-to-end trainable model inserted with a proposed dual-path attention with distribution analysis network is established to minimize misalignment caused by irrelevant matching. This architecture is effective in terms of split of path by the distribution analysis such that targeted attention mechanisms can be designed to capture truly contributing text-region pairs. In specific, the proposed row-wise attention and column-wise attention accomplish relative similarity analysis in both query modality and retrieval modality simultaneously. In each retrieval direction, the significance of relevance could be comprehensively justified along with latent alignment inference. Meanwhile, this method not only filters irrelevant retrieval current studies that mainly aim at, but also provides more reasonable order of retrieval results. Experimental results on public benchmarks illustrate noticeable improvement on text-image matching, especially for text retrieval direction.








Similar content being viewed by others
References
Yifan, L., Xuan, W., Shuhan, Q., Chengkai, H., Zoe, J., et al.: Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval. Signal Image Video Process. 15, 673–680 (2021)
Faghri, F., Fleet, D. J., Kiros, J. R., et al.: VSE++: Improving visual-semantic embeddings with hard negatives. In: Proc. British. Mach Vision Conf. pp. 935–943 (2018)
Liu, Y., Guo, Y., Bakker, E. M., et al.: Learning a recurrent residual fusion network for multimodal matching. In: Proc. IEEE Int. Conf. Comput. Vision., pp. 4107–4116 (2017)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2310–2318 (2017)
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pp. 6077–6086 (2018)
Lee, K. H., Chen, X., Hua, G., et al.: Stacked cross attention for image-text matching. In: Proc. European Conf. Comput. Vision., pp. 201–216 (2018)
Lee, K. H., Palangi, H., Chen, X., et al.: Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953 (2019)
Wang, Z., Liu, X., Li, H., et al.: CAMP: Cross-modal adaptive message passing for text-image retrieval. In: Proc. IEEE Int. Conf. Comput Vision., pp. 5764–5773 (2019)
Wen, Z., Xin, W., Jie, L.: Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans. Image Process. 30, 1180–1192 (2021)
Liu, C., Mao, Z., Liu, A. A., et al: Focus your attention: a bidirectional focal attention network for image-text matching. In: Proc 27th ACM Int Conf. Multimedia, pp. 3–11 (2019)
Xia, Y., Huang, L., Wang, W., et al.: Exploring entity-Level spatial relationships for image-text matching. In: Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., pp. 4452–4456 (2020)
Wang, S., Wang, R., Yao, Z., et al.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proc. IEEE Winter Conf. Applications of Comput. Vision, pp. 1508–1517 (2020)
Ji, Z., Lin, Z., Wang, H., et al.: Multi-modal memory enhancement attention network for image-text matching. IEEE Access 8, 38438–38447 (2020)
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(62), 1137–1149 (2016)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pp. 770–778 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: Proc. IEEE Int. Conf. Pattern Recognit., pp. 850–855 (2006)
Lin, T. Y., Maire, M., Belongie, S., et al.: Microsoft COCO: Common objects in context. In: Proc. European Conf. Comput. Vision., pp. 740–755 (2014)
Plummer, B. A., Wang, L., Cervantes, C. M., et al.: Flickr 30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. IEEE Int. Conf. Comput. Vision., pp. 2641–2649 (2015)
Niu, Z., Zhou, M., Wang, L., et al.: Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proc. IEEE Int. Conf. Comput Vision., pp. 1881–1889 (2017)
Huang, Y., Wu, Q., Song, C., et al.: Learning semantic concepts and order for image and sentence matching. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6163–6171 (2018)
Gu, J., Cai, J., Joty, S. R., Niu, L., Wang, G.: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7181–7189 (2018)
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1979–1988 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the National Nature Science Foundation of China under Grant 61872143 .
Rights and permissions
About this article
Cite this article
Li, W., Zhu, H., Yang, S. et al. DADAN: dual-path attention with distribution analysis network for text-image matching. SIViP 16, 797–805 (2022). https://doi.org/10.1007/s11760-021-02020-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-021-02020-2