Abstract
Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
O’Halloran, K.L.: Interdependence, interaction and metaphor in multi-semiotic texts. Soc. Semiotics 9(3), 317 (1999)
Morency, L.P., Baltrusaitis, T.: Tutorial on multimodal machine learning, Language Technologies Institute (2019). https://www.cs.cmu.edu/morency/MMMLTutorial-ACL2017.pdf
Yan, M., Chan, C.A., Li, W., Lei, L., Gygax, A.F., Chih-Lin, I.: Assessing the energy consumption of proactive mobile edge caching in wireless networks. IEEE Access 7, 104394–104404 (2019)
Sasaki, S., Hirai, T., Ohya, H., Morishima, S.: Affective music recommendation system based on the mood of input video. In: He, X., Luo, S., Tao, D., Xu, C., Yang, J., Hasan, M.A. (eds.) MMM 2015. LNCS, vol. 8936, pp. 299–302. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14442-9_33
Liwei, W., Yin, L., Jing, H., et al.: Learning two branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 210–223 (2018)
Lee, K.-H., Xi, C., Gang, H., et al.: Stacked Cross Attention for Image-Text Matching, arXiv preprint arXiv:1803.08024 (2018)
Jin, C., Tie, Y., Bai, Y., Lv, X., Liu, S.: A style-specific music composition neural network. Neural Process. Lett. 52(3), 1893–1912 (2020). https://doi.org/10.1007/s11063-020-10241-8
Andrej, K., Armand, J., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)
Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G.J., Tang, J.: Few-shot image recognition with knowledge transfer. In: Proceedings of ICCV, pp. 441–449 (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Li, Z., Tang, J.: Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(1), 276–288 (2016)
Acar, E., Hopfgartner, F., Albayrak, S.: Understanding affective content of music videos through learned representations. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014. LNCS, vol. 8325, pp. 303–314. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04114-8_26
Xu, Y., Kong, Q., Huang, Q., Wang, W., Plumbley, M.D.: Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging, arXiv preprint arXiv:1703.06052 (2017)
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of CVPR, pp. 1–9 (2015)
Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015)
Li, Z., Tang, J.: Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans. Multimedia 17(11), 1989–1999 (2015)
Song, K., Nie, F., Han, J., Li, X.: Parameter free large margin nearest neighbor for distance metric learning. In: AAAI (2017)
Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of CVPR, pp. 5022–5030 (2019)
Ge, W., Huang, W., Dong, D., Scott, M.R.: Deep metric learning with hierarchical triplet loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 272–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_17
Zhou, Y., Wang, Z., Fang, C., et al.: Visual to sound: generating natural sound for videos in the wild. In: Proceedings of the CVPR, pp. 3550–3558 (2018)
Canyi, L., Jiashi, F., Yudong, C., et al.: Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 925–938 (2019)
Wegelin, J.A., et al.: A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case. University of Washington, Technical report (2000)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural comput. 16(12), 2639–2664 (2004)
Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62
Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255 (2013)
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2030–2096 (2016)
Acknowledgment
This research was supported by the National Natural Science Foundation of China (Grant No. 61631016 and 61901421), National Key R&D Program of China (Grant No. 2018YFB1403903) and the Fundamental Research Funds for the Central Universities (Grant No. CUC200B017, 2019E002 and CUC19ZD003).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jin, C. et al. (2021). Cross-modal Deep Learning Applications: Audio-Visual Retrieval. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-68780-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)