Skip to main content
Log in

DBFC-Net: a uniform framework for fine-grained cross-media retrieval

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

With the growth of various types of media data (e.g., text, image, video, and audio), fine-grained cross-media retrieval which aims to provide flexible and accurate query service has attracted significant attention. Different from traditional keyword-based retrieval, the queries and results in fine-grained cross-media retrieval might be different types. In this work, we demonstrate that the quality of retrieval results can be further improved by additionally considering the media-specified information. Besides, the process of feature extraction should be different for queries with various media types. To this end, we propose a novel network architecture, namely Double Branch Fine-grained Cross-media Net (DBFC-Net), which is the first work that can use the media-specific information to construct the common features by a uniform framework. Furthermore, we devise an effective distance metric (cosine+) for fine-grained cross-media retrieval. Compared with commonly-used metrics (e.g., cosine function), our proposed cosine+ metric is well adaptive to handle fine-grained retrieval scenarios. Extensive experiments and ablation studies on publicly available datasets demonstrate the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Sanchez-Nielsen, E., Chavez-Gutierrez, F., Lorenzo-Navarro, J.: A semantic parliamentary multimedia approach for retrieval of video clips with content understanding. Multimedia Syst. 25(4), 1–18 (2019)

    Article  Google Scholar 

  2. Li, Xirong: Tag relevance fusion for social image retrieval. Multimedia Syst. 23(1), 29–40 (2017)

    Article  Google Scholar 

  3. Yao, Y., Shen, F., Xie, G., Liu, L., Zhu, F., Zhang, J., Shen, H.T.: Exploiting web images for multi-output classification: From category to subcategories. IEEE Trans. Neural Netw. Learn. Syst. 31(7), 2348–2360 (2020)

    Google Scholar 

  4. Yao, Y., Zhang, J., Shen, F., Liu, L., Zhu, F., Zhang, D., Shen, H.T.: Towards automatic construction of diverse, high-quality image datasets. IEEE Trans. Knowl. Data Eng. 32(6), 1199–1211 (2019)

    Article  Google Scholar 

  5. Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting privileged information for enhancing classifier learning. IEEE Trans. Image Process. 28(1), 436–450 (2018)

    Article  MathSciNet  Google Scholar 

  6. Yao, Y., Zhang, J., Shen, F., Hua, X., Xu, J., Tang, Z.: Exploiting web images for dataset construction: a domain robust approach. IEEE Trans. Multimed. 19(8), 1771–1784 (2017)

    Article  Google Scholar 

  7. Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2372–2385 (2017)

    Article  Google Scholar 

  8. B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen: Adversarial cross-modal retrieval, in Proceedings of the 25th ACM international conference on Multimedia, pp. 154–162 (2017)

  9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., pp. 2672–2680 (2014)

  10. Zhuo, Y., Qi, J. and P. YX: Cross-media deep fine-grained correlation learning, in Journal of Software 30(4), 884–895 (2019)

  11. J. Schmidhuber, Long short-term memory, Neural Comput.z, vol. 9, no. 8, pp. 1735–1780

  12. P. Y. X. Qi Jin-Wei, Huang Xin: A cross-media shared representation learning method based on cascaded deep networks (2019)

  13. Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. in IJCAI, pp. 3846–3853 (2016)

  14. C. Pan, Cross-modal retrieval via deep and bidirectional representation learning, IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1363–1377

  15. He, X., Peng, Y., Xie, L.: A new benchmark and approach for fine-grained cross-media retrieval, in Proceedings of the 27th ACM International Conference on Multimedia, pp. 1740–1748 (2019)

  16. Huang, X., Peng, Y., Yuan, M.: Mhtn: Modal-adversarial hybrid transfer network for cross-modal retrieval, IEEE transactions on cybernetics (2018)

  17. J. Xiao, Learning cross-media joint representation with sparse and semisupervised regularization, IEEE Transactions on Circuits & Systems for Video Technology, vol. 24, no. 6, pp. 965–978

  18. Biswas, S.: Generalized semantic preserving hashing for n-label cross-modal retrieval, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  19. S. Yan, Cross-modal retrieval with CNN visual features: a new baseline, IEEE Transactions on Cybernetics, vol. 47, no. 2, pp. 449–460

  20. Salakhutdinov, R.: Multimodal learning with deep Boltzmann machines, in International Conference on Neural Information Processing Systems (2012)

  21. Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text, in in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

  22. Gröchenig, K.: Found. Time-freq. Anal. 33(6), 464–466 (2001)

    Google Scholar 

  23. Xue, X.: Multi-stream multi-class fusion of deep networks for video classification, in the 2016 ACM (2016)

  24. Jian, S.: Deep residual learning for image recognition, in IEEE Conference on Computer Vision & Pattern Recognition (2016)

  25. Yu, Q.: A Discriminative Feature Learning Approach for Deep Face Recognition. Springer International Publishing, Berlin (2016)

    Google Scholar 

  26. Zheng, N.: Person re-identification by multi-channel parts-based cnn with improved triplet loss function, in Computer Vision & Pattern Recognition (2016)

  27. Zhang, J., Peng, Y.: Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimed. 22(1), 174–187 (2019)

    Article  Google Scholar 

  28. Peng, Y., Qi, J.: Quintuple-media joint correlation learning with deep compression and regularization. IEEE Trans. Circuits Syst. Video Technol. 30(8), 2709–2722 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yazhou Yao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Guo, Y. & Yao, Y. DBFC-Net: a uniform framework for fine-grained cross-media retrieval. Multimedia Systems 28, 423–432 (2022). https://doi.org/10.1007/s00530-021-00825-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00825-2

Keywords

Navigation