Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Jin, Cong; Zhang, Tian; Liu, Shouxun; Tie, Yun; Lv, Xin; Li, Jianguang; Yan, Wencai; Yan, Ming; Xu, Qian; Guan, Yicong; Yang, Zhenggougou

doi:10.1007/978-3-030-68780-9_26

Cong Jin¹⁶,
Tian Zhang^16,17,
Shouxun Liu¹⁸,
Yun Tie¹⁷,
Xin Lv¹⁹,
Jianguang Li¹⁶,
Wencai Yan¹⁷,
Ming Yan¹⁶,
Qian Xu¹⁹,
Yicong Guan¹⁹ &
…
Zhenggougou Yang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Included in the following conference series:

International Conference on Pattern Recognition

2367 Accesses
3 Citations

Abstract

Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

O’Halloran, K.L.: Interdependence, interaction and metaphor in multi-semiotic texts. Soc. Semiotics 9(3), 317 (1999)
Article Google Scholar
Morency, L.P., Baltrusaitis, T.: Tutorial on multimodal machine learning, Language Technologies Institute (2019). https://www.cs.cmu.edu/morency/MMMLTutorial-ACL2017.pdf
Yan, M., Chan, C.A., Li, W., Lei, L., Gygax, A.F., Chih-Lin, I.: Assessing the energy consumption of proactive mobile edge caching in wireless networks. IEEE Access 7, 104394–104404 (2019)
Article Google Scholar
Sasaki, S., Hirai, T., Ohya, H., Morishima, S.: Affective music recommendation system based on the mood of input video. In: He, X., Luo, S., Tao, D., Xu, C., Yang, J., Hasan, M.A. (eds.) MMM 2015. LNCS, vol. 8936, pp. 299–302. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14442-9_33
Chapter Google Scholar
Liwei, W., Yin, L., Jing, H., et al.: Learning two branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 210–223 (2018)
Google Scholar
Lee, K.-H., Xi, C., Gang, H., et al.: Stacked Cross Attention for Image-Text Matching, arXiv preprint arXiv:1803.08024 (2018)
Jin, C., Tie, Y., Bai, Y., Lv, X., Liu, S.: A style-specific music composition neural network. Neural Process. Lett. 52(3), 1893–1912 (2020). https://doi.org/10.1007/s11063-020-10241-8
Article Google Scholar
Andrej, K., Armand, J., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)
Google Scholar
Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of ICCV, pp. 441–449 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Li, Z., Tang, J.: Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(1), 276–288 (2016)
Article MathSciNet Google Scholar
Acar, E., Hopfgartner, F., Albayrak, S.: Understanding affective content of music videos through learned representations. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014. LNCS, vol. 8325, pp. 303–314. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04114-8_26
Chapter Google Scholar
Xu, Y., Kong, Q., Huang, Q., Wang, W., Plumbley, M.D.: Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging, arXiv preprint arXiv:1703.06052 (2017)
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of CVPR, pp. 1–9 (2015)
Google Scholar
Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015)
Article Google Scholar
Li, Z., Tang, J.: Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans. Multimedia 17(11), 1989–1999 (2015)
Article Google Scholar
Song, K., Nie, F., Han, J., Li, X.: Parameter free large margin nearest neighbor for distance metric learning. In: AAAI (2017)
Google Scholar
Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of CVPR, pp. 5022–5030 (2019)
Google Scholar
Ge, W., Huang, W., Dong, D., Scott, M.R.: Deep metric learning with hierarchical triplet loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 272–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_17
Chapter Google Scholar
Zhou, Y., Wang, Z., Fang, C., et al.: Visual to sound: generating natural sound for videos in the wild. In: Proceedings of the CVPR, pp. 3550–3558 (2018)
Google Scholar
Canyi, L., Jiashi, F., Yudong, C., et al.: Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 925–938 (2019)
Google Scholar
Wegelin, J.A., et al.: A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case. University of Washington, Technical report (2000)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural comput. 16(12), 2639–2664 (2004)
Article Google Scholar
Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62
Chapter Google Scholar
Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255 (2013)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2030–2096 (2016)
MathSciNet Google Scholar

Download references

Acknowledgment

This research was supported by the National Natural Science Foundation of China (Grant No. 61631016 and 61901421), National Key R&D Program of China (Grant No. 2018YFB1403903) and the Fundamental Research Funds for the Central Universities (Grant No. CUC200B017, 2019E002 and CUC19ZD003).

Author information

Authors and Affiliations

School of Information and Communication Engineering, Key Laboratory of Convergent Media and Intelligent Technology, Ministry of Education, Communication University of China, Beijing, 100024, China
Cong Jin, Tian Zhang, Jianguang Li & Ming Yan
School of Information and Engineering, Zhengzhou University, Zhengzhou, 450001, China
Tian Zhang, Yun Tie & Wencai Yan
Communication University of China, Beijing, 100024, China
Shouxun Liu
School of Animation and Digital Arts, Communication University of China, Beijing, 100024, China
Xin Lv, Qian Xu & Yicong Guan
Broadcasting and Anchoring School, Communication University of China, Beijing, 100024, China
Zhenggougou Yang

Authors

Cong Jin
View author publications
You can also search for this author in PubMed Google Scholar
Tian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shouxun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yun Tie
View author publications
You can also search for this author in PubMed Google Scholar
Xin Lv
View author publications
You can also search for this author in PubMed Google Scholar
Jianguang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wencai Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ming Yan
View author publications
You can also search for this author in PubMed Google Scholar
Qian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yicong Guan
View author publications
You can also search for this author in PubMed Google Scholar
Zhenggougou Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Cong Jin , Yun Tie or Xin Lv .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, C. et al. (2021). Cross-modal Deep Learning Applications: Audio-Visual Retrieval. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-68780-9_26
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)