Skip to main content

A Mutual Information-Based Disentanglement Framework for Cross-Modal Retrieval

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13111))

Included in the following conference series:

  • 1943 Accesses

Abstract

Cross-modal retrieval essentially extracts the shared semantics of an object between two different modalities. However, “modality gap” may significantly limit the performance when analyzing from each modality sample. In this paper, to overcome the characteristics from heterogeneous data, we propose a novel mutual information-based disentanglement framework to capturing the precise shared semantics in cross-modal scenes. Firstly, we design a disentanglement framework to extract the shared parts of modalities, which can provide the basis for semantic measuring with mutual information. Secondly, we measure semantic associations from the perspective of distribution, which overcomes perturbations brought by “modality gap”. Finally, we formalize our framework and theoretically prove that mutual information can obtain remarkable performance under the disentanglement framework. Sufficient experimental results evaluated on two large benchmarks demonstrate that our approach can obtain significant performance in cross-modal retrieval task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alajaji, F., Chen, P.-N.: An Introduction to Single-User Information Theory. SUTMT, Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8001-2

    Book  MATH  Google Scholar 

  2. Belghazi, M.I., et al.: Mutual information neural estimation. In: 35th International Conference on Machine Learning, pp. 530–539. PMLR, Stockholmsmässan, Stockholm, Sweden (2018)

    Google Scholar 

  3. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  4. Chen, T., Deng, J., Luo, J.: Adaptive offline quintuplet loss for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 549–565. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_33

    Chapter  Google Scholar 

  5. Faghri, F., Fleet, D. J., Kiros, J. R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: 29th British Machine Vision Conference, Article 12. BMVA Press Newcastle, UK (2018)

    Google Scholar 

  6. Guo, W., Huang, H., Kong, X., He, R.: Learning disentangled representation for cross-modal retrieval with deep mutual information estimation. In: 27th ACM International Conference on Multimedia, pp. 1712–1720. ACM, Nice, France (2019)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 26th IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE Computer Society, Las Vegas, NV, USA (2016)

    Google Scholar 

  8. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: 25th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. IEEE Computer Society, Boston, MA, USA (2015)

    Google Scholar 

  9. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13

    Chapter  Google Scholar 

  10. Li, C., Deng, C., Li, N., Liu, W., Gao, X., Tao, D.: Self-supervised adversarial hashing networks for cross-modal retrieval. In: 28th IEEE Conference on Computer Vision and Pattern Recognition, pp. 4242–4251. IEEE Computer Society, Salt Lake City, UT, USA (2018)

    Google Scholar 

  11. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  12. Liu, C., Mao, Z., Liu, A. A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: 27th ACM International Conference on Multimedia, pp. 3–11. ACM, Nice, France (2019)

    Google Scholar 

  13. Ma, D., Zhai, X., Peng, Y.: Cross-media retrieval by cluster-based correlation analysis. In: 20th IEEE International Conference on Image Processing, pp. 3986–3990. IEEE, Melbourne, Australia (2013)

    Google Scholar 

  14. Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2372–2385 (2017)

    Article  Google Scholar 

  15. Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27(11), 5585–5599 (2018)

    Article  MathSciNet  Google Scholar 

  16. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 15th IEEE International Conference on Computer Vision, pp. 2641–2649. IEEE Computer Society, Santiago, Chile (2015)

    Google Scholar 

  17. Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: 18th ACM International Conference on Multimedia, pp. 251–260. ACM, Firenze, Italy (2010)

    Google Scholar 

  18. Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: 29th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1979–1988. IEEE, Long Beach, CA, USA (2019)

    Google Scholar 

  19. Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: 25th ACM International Conference on Multimedia, pp. 154–162. ACM, Mountain View, CA, USA (2017)

    Google Scholar 

  20. Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking. In: 27th ACM International Conference on Multimedia, pp. 12–20. ACM, Nice, France (2019)

    Google Scholar 

  21. Wang, Z., et al.: Camp: cross-modal adaptive message passing for text-image retrieval. In: 17th IEEE International Conference on Computer Vision, pp. 5763–5772. IEEE, Seoul, Korea (South) (2019)

    Google Scholar 

  22. Wehrmann, J., Kolling, C., Barros, R. C.: Adaptive cross-modal embeddings for image-text alignment. In: 32nd AAAI Conference on Artificial Intelligence, pp. 12313–12320. AAAI Press, New York, NY, USA (2020)

    Google Scholar 

  23. Wei, J., Xu, X., Yang, Y., Ji, Y., Wang, Z., Shen, H.T.: Universal weighting metric learning for cross-moda matching. In: 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13002–13011. IEEE, Seattle, WA, USA (2020)

    Google Scholar 

  24. Wu, H., Merler, M., Uceda-Sosa, R., Smith, J. R.: Learning to make better mistakes: Semantics-aware visual food recognition. In: 24th ACM International Conference on Multimedia, pp. 172–176. ACM, Amsterdam, The Netherlands (2016)

    Google Scholar 

  25. Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: 17th ACM international Conference on Multimedia, pp. 175–184. ACM, Vancouver, British Columbia, Canada (2009)

    Google Scholar 

  26. Zhang, W., et al.: Photo stream question answer. In: 28th ACM International Conference on Multimedia, pp. 3966–3975. ACM, Virtual Event (2020)

    Google Scholar 

  27. Zhuang, Y.T., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimedia 10(2), 221–229 (2008)

    Article  Google Scholar 

  28. Li, C., Zhu, C., Huang, Y., Tang, J., Wang, L.: Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 831–847. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_49

    Chapter  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (61972455), and the Joint Project of Bayescom. Xiaowang Zhang is supported by the program of Peiyang Young Scholars in Tianjin University (2019XRX-0032).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaowang Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, H. et al. (2021). A Mutual Information-Based Disentanglement Framework for Cross-Modal Retrieval. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13111. Springer, Cham. https://doi.org/10.1007/978-3-030-92273-3_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92273-3_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92272-6

  • Online ISBN: 978-3-030-92273-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics