Skip to main content

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

  • Conference paper
  • First Online:
Pattern Recognition. ICPR International Workshops and Challenges (ICPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Included in the following conference series:

Abstract

Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. O’Halloran, K.L.: Interdependence, interaction and metaphor in multi-semiotic texts. Soc. Semiotics 9(3), 317 (1999)

    Article  Google Scholar 

  2. Morency, L.P., Baltrusaitis, T.: Tutorial on multimodal machine learning, Language Technologies Institute (2019). https://www.cs.cmu.edu/morency/MMMLTutorial-ACL2017.pdf

  3. Yan, M., Chan, C.A., Li, W., Lei, L., Gygax, A.F., Chih-Lin, I.: Assessing the energy consumption of proactive mobile edge caching in wireless networks. IEEE Access 7, 104394–104404 (2019)

    Article  Google Scholar 

  4. Sasaki, S., Hirai, T., Ohya, H., Morishima, S.: Affective music recommendation system based on the mood of input video. In: He, X., Luo, S., Tao, D., Xu, C., Yang, J., Hasan, M.A. (eds.) MMM 2015. LNCS, vol. 8936, pp. 299–302. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14442-9_33

    Chapter  Google Scholar 

  5. Liwei, W., Yin, L., Jing, H., et al.: Learning two branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 210–223 (2018)

    Google Scholar 

  6. Lee, K.-H., Xi, C., Gang, H., et al.: Stacked Cross Attention for Image-Text Matching, arXiv preprint arXiv:1803.08024 (2018)

  7. Jin, C., Tie, Y., Bai, Y., Lv, X., Liu, S.: A style-specific music composition neural network. Neural Process. Lett. 52(3), 1893–1912 (2020). https://doi.org/10.1007/s11063-020-10241-8

    Article  Google Scholar 

  8. Andrej, K., Armand, J., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)

    Google Scholar 

  9. Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of ICCV, pp. 441–449 (2019)

    Google Scholar 

  10. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  11. Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  12. Li, Z., Tang, J.: Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(1), 276–288 (2016)

    Article  MathSciNet  Google Scholar 

  13. Acar, E., Hopfgartner, F., Albayrak, S.: Understanding affective content of music videos through learned representations. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014. LNCS, vol. 8325, pp. 303–314. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04114-8_26

    Chapter  Google Scholar 

  14. Xu, Y., Kong, Q., Huang, Q., Wang, W., Plumbley, M.D.: Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging, arXiv preprint arXiv:1703.06052 (2017)

  15. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of CVPR, pp. 1–9 (2015)

    Google Scholar 

  16. Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015)

    Article  Google Scholar 

  17. Li, Z., Tang, J.: Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans. Multimedia 17(11), 1989–1999 (2015)

    Article  Google Scholar 

  18. Song, K., Nie, F., Han, J., Li, X.: Parameter free large margin nearest neighbor for distance metric learning. In: AAAI (2017)

    Google Scholar 

  19. Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of CVPR, pp. 5022–5030 (2019)

    Google Scholar 

  20. Ge, W., Huang, W., Dong, D., Scott, M.R.: Deep metric learning with hierarchical triplet loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 272–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_17

    Chapter  Google Scholar 

  21. Zhou, Y., Wang, Z., Fang, C., et al.: Visual to sound: generating natural sound for videos in the wild. In: Proceedings of the CVPR, pp. 3550–3558 (2018)

    Google Scholar 

  22. Canyi, L., Jiashi, F., Yudong, C., et al.: Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 925–938 (2019)

    Google Scholar 

  23. Wegelin, J.A., et al.: A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case. University of Washington, Technical report (2000)

    Google Scholar 

  24. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural comput. 16(12), 2639–2664 (2004)

    Article  Google Scholar 

  25. Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62

    Chapter  Google Scholar 

  26. Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255 (2013)

    Google Scholar 

  27. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2030–2096 (2016)

    MathSciNet  Google Scholar 

Download references

Acknowledgment

This research was supported by the National Natural Science Foundation of China (Grant No. 61631016 and 61901421), National Key R&D Program of China (Grant No. 2018YFB1403903) and the Fundamental Research Funds for the Central Universities (Grant No. CUC200B017, 2019E002 and CUC19ZD003).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Cong Jin , Yun Tie or Xin Lv .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jin, C. et al. (2021). Cross-modal Deep Learning Applications: Audio-Visual Retrieval. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-68780-9_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68779-3

  • Online ISBN: 978-3-030-68780-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics