research-article

Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation

Authors:

Zhenzhong ChenAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 50, Pages 1 - 25

https://doi.org/10.1145/3650042

Published: 15 April 2024 Publication History

Abstract

Object-oriented micro-video background music recommendation is a complicated task where the matching degree between videos and background music is a major issue. However, music selections in user-generated content (UGC) are prone to selection bias caused by historical preferences of uploaders. Since historical preferences are not fully reliable and may reflect obsolete behaviors, over-reliance on them should be avoided as knowledge and interests dynamically evolve. In this article, we propose a Deconfounded Cross-Modal matching model to mitigate such bias. Specifically, uploaders’ personal preferences of music genres are identified as confounders that spuriously correlate music embeddings and background music selections, causing the learned system to over-recommend music from majority groups. To resolve such confounders, backdoor adjustment is utilized to deconfound the spurious correlation between music embeddings and prediction scores. We further utilize Monte Carlo estimator with batch-level average as the approximations to avoid integrating the entire confounder space calculated by the adjustment. Furthermore, we design a teacher–student network to utilize the matching of music videos, which is professionally generated content (PGC) with specialized matching, to better recommend content-matching background music. The PGC data are modeled by a teacher network to guide the matching of uploader-selected UGC data of student network by Kullback–Leibler–based knowledge transfer. Extensive experiments on the TT-150k-genre dataset demonstrate the effectiveness of the proposed method. The code is publicly available on https://github.com/jing-1/DecCM

References

[1]

Shoshana Abramovich and Lars-Erik Persson. 2016. Some new estimates of the ‘Jensen gap’. J. Inequal. Appl. 1 (2016), 1–9.

[2]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv:1609.08675. Retrieved from https://arxiv.org/abs/1609.08675

[3]

Fabiano M. Belém, Carolina S. Batista, Rodrygo L. T. Santos, Jussara M.Almeida, and Marcos A. Gonçalves. 2016. Beyond relevance: Explicitly promoting novelty and diversity in tag recommendation. ACM Trans. Intell. Syst. Technol. 7, 3 (2016), 1–34.

Digital Library

[4]

Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H. Chi, et al. 2019. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2212–2220.

Digital Library

[5]

Jiansong Chao, HaofenWang,Wenlei Zhou,Weinan Zhang, and Yong Yu. 2011. Tunesensor: A semantic-driven music recommendation service for digital photo albums. In Proceedings of the International Semantic Web Conference. 353–361.

[6]

Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to debias for recommendation. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 21–30.

Digital Library

[7]

Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2022. Bias and debias in recom-mender system: A survey and future directions. ACM Trans. Inf. Syst. 41, 3 (2022). 1–39.

[8]

Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. 2021. Video background music generation with controllable music transformer. In Proceedings of the ACM International Conference on Multimedia. 2037–2045.

Digital Library

[9]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 7 (2011). 2121–2159.

[10]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia. 1459–1462.

Digital Library

[11]

Prasenjeet Fulzele, Rajat Singh, Naman Kaushik, and Kavita Pandey. 2018. A hybrid model for music genre classification using LSTM and SVM. In Proceedings of the International Conference on Contemporary Computing.1–3.

[12]

Madelyn Glymour, Judea Pearl, and Nicholas P. Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer and Visual Pattern Recognition.770–778.

[14]

Xiangnan He, Yang Zhang, Fuli Feng, Chonggang Song, Lingling Yi, Guohui Ling, and Yongdong Zhang. 2022. Addressing confounding feature issue for causal recommendation. ACM Trans. Inf. Syst. 41, 3 (2022). 1–23.

[15]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.131–135.

Digital Library

[16]

Sungeun Hong, Woobin Im, and Hyun S. Yang. 2018. CBVMR: Content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the ACM International Conference on Multimedia Retrieval. 353–361.

Digital Library

[17]

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations.

[18]

Jae-woong Lee, Seongmin Park, and Jongwuk Lee. 2021. Dual unbiased recommender learning for implicit feedback. In Proceedings of the Annual International ACM SIGIR Conference Research and Development in Information Retrieval. 1647–1651.

Digital Library

[19]

Bochen Li and Aparna Kumar. 2019. Query by video: Cross-modal music retrieval. In International Society for Music Information Retrieval Conference.604–611.

[20]

Cheng-Te Li, Cheng Hsu, and Yang Zhang. 2022. FairSR: Fairness-aware sequential recommendation through multi-task learning with preference graph embeddings. ACM Trans. Intell. Syst. Technol. 13, 1 (2022), 1–21.

Digital Library

[21]

Qian Li, Xiangmeng Wang, Zhichao Wang, and Guandong Xu. 2023. Be causal: De-biasing social network confounding in recommendation. ACM Trans. Knowl. Discov. Data 17, 1 (2023), 1–23.

Digital Library

[22]

Dawen Liang, Laurent Charlin, and David M. Blei. 2016. Causal inference for recommendation. In Proceedings of the Uncertainty Artificial Intelligence Workshop.

[23]

Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2015. EMV-matchmaker: Emotional temporal course modeling and matching for automatic music video generation. In Proceedings of the ACM International Conference on Multimedia. 899–902.

Digital Library

[24]

Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2016. Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In Proceedings of the ACM International Conference on Multimedia. 372–376.

Digital Library

[25]

Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2016. DEMV-matchmaker: Emotional temporal course representation and deep similarity matching for automatic music video generation. In IEEE International Conference on Acoustics, Speech, and Signal Processing.2772–2776.

Digital Library

[26]

Jen-Chun Lin, Wen-Li Wei, James Yang, Hsin-Min Wang, and Hong-Yuan Mark Liao. 2017. Automatic music video generation based on simultaneous soundtrack recommendation and video editing. In Proceedings of the ACM International Conference on Multimedia. 519–527.

Digital Library

[27]

Chien-Liang Liu and Ying-Chuan Chen. 2018. Background music recommendation based on latent factors and moods. Knowl. Bas, Syst. 159 (2018), 158–170.

[28]

Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A general knowledge distillation framework for counterfactual recommendation via uniform data. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 831–840.

Digital Library

[29]

Zhongzhou Liu, Yuan Fang, and Min Wu. 2022. Mitigating popularity bias for users and items with fairness-centric adaptive recommendation. ACM Trans. Inf. Syst. 41, 3 (2022). 1–27.

[30]

Nicholas Metropolis and S. Ulam. 1949. The monte carlo method. J. Am. Stat. Assoc. 44, 247 (1949), 335–341.

[31]

Laure Pretet, Gael Richard, Clement Souchier, and Geoffroy Peeters. 2022. Video-to-music recommendation using temporal alignment of segments. IEEE Trans. Multimedia (2022). DOI:

Digital Library

[32]

Fernando P. Santos, Yphtach Lelkes, and Simon A. Levin. 2021. Link recommendation algorithms and dynamics of polarization in online social networks. Proc. Natl. Acad. Sci. U.S.A. 118, 50 (2021).

[33]

Shoto Sasaki, Tatsunori Hirai, Hayato Ohya, and Shigeo Morishima. 2015. Affective music recommendation system based on the mood of input video. In Multimedia Modeling International Conference, Vol. 8936. 299–302.

[34]

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the International Conference on Machine Learning.1670–1679.

[35]

Lanyu Shang, Zhang Daniel Yue, Khan Siamul Karim, Jialie Shen, and Dong Wang. 2020. CaMR: Towards connotation-aware music retrieval on social media with visual inputs. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networking Analysis and Mining. 425–429.

Digital Library

[36]

Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, and Lawrence Carin. 2018. Deconvolutional latent-variable model for text sequence matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 5438–5445.

[37]

Paras Sheth, Ruocheng Guo, Lu Cheng, Huan Liu, and Kasim Selçuk Candan. 2023. Causal disentanglement for implicit recommendations with network information. ACM Trans. Knowl. Discov. Data (2023).

Digital Library

[38]

Mohammad Soleymani, Micheal N. Caro, Erik M. Schmidt, Cheng-Ya Sha, and Yi-Hsuan Yang. 2013. 1000 songs for emotional analysis of music. In Proceedings of the ACM International Workshop on Crowdsourcing Multimedia. 1–6.

Digital Library

[39]

Harald Steck. 2018. Calibrated recommendations. In Proceedings of the ACM Conference on Recommender Systems.154–162.

Digital Library

[40]

Keith Sunstein. 2006. Infotopia: How Many Minds Produce Knowledge.

Digital Library

[41]

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i Nieto. 2018. Cross-modal embeddings for video and audio retrieval. In Proceedings of the European Conference on Computer Vision Workshops.

[42]

Robert E. Thayer. 1990. The Biopsychology of Mood and Arousal. Oxford University Press.

[43]

G. Tzanetakis. 1999. Gtzan Music/Speech Collection. Retrieved from http://marsyas.info/index.html

[44]

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215

[45]

Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. 2021. Deconfounded recommendation for alleviating bias amplification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1717–1725.

Digital Library

[46]

Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1288–1297.

Digital Library

[47]

Yujia Wang, Wei Liang, Wanwan Li, Dingzeyu Li, and Lap-Fai Yu. 2020. Scene-aware background music synthesis. In Proceedings of the ACM International Conference on Multimedia. 1162–1170.

Digital Library

[48]

Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.13005–13014.

[49]

Zhiquan Wen, Guanghui Xu, Mingkui Tan, Qingyao Wu, and Qi Wu. 2021. Debiased visual question answering from feature and sample perspectives. Proceedings of the Advances in Neural Information Processing Systems Conference. 34 (2021).

[50]

Qiong Wu, Yong Liu, Chunyan Miao, Binqiang Zhao, Yin Zhao, and Lu Guan. 2019. PD-GAN: Adversarial learning for personalized diversity-promoting recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 19. 3870–3876.

[51]

Chenxiao Yang, Qitian Wu, Qingsong Wen, Zhiqiang Zhou, Liang Sun, and Junchi Yan. 2022. Towards out-of-distribution sequential event prediction: A causal treatment. In Proceedings of the Advances in Neural Information Processing Systems Conference.

[52]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the ACM Conference on Recommender Systems.279–287.

Digital Library

[53]

Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Trans. Knowl. Discov. Data 15, 5 (2021), 1–46.

Digital Library

[54]

Jing Yi, Yaochen Zhu, Jiayi Xie, and Zhenzhong Chen. 2021. Cross-modal variational auto-encoder for content-based micro-video background music recommendation. IEEE Trans. Multimedia 25 (2021). 2898–2911.

[55]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2021. Towards debiasing temporal sentence grounding in video. arXiv:2111.04321.

[56]

Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–20.

Digital Library

[57]

Yihao Zhang, Chu Zhao, Weiwen Liao, Wei Zhou, and Meng Yuan. 2023. Asymmetrical attention networks fused autoencoder for debiased recommendation. ACM Trans. Intell. Syst. Technol. 14, 6 (2023). 1–24.

[58]

Boxiang Zhao, Shuliang Wang, Lianhua Chi, Qi Li, Xiaojia Liu, and Jing Geng. 2023. Causal discovery via causal star graphs. ACM Trans. Knowl. Discov. Data 17, 7 (2023). 1–24.

[59]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.10394–10403.

Index Terms

Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Music recommendation based on affective image content analysis
Abstract
Music has the ability to invest even the tritest scenes with so much meaning when added to them. Human perceptions of music and image can be closely related to each other, as both can incite similar sensations and emotions. Advertising agencies ...
Semantic Based Background Music Recommendation for Home Videos
MMM 2014: Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8326

In this paper, we propose a new background music recommendation scheme for home videos and two new features describing the short-term motion/tempo distribution in visual/aural content. Unlike previous researches that merely matched the visual and aural ...
Content-based music audio recommendation
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

We present the MusicSurfer, a metadata free system for the interaction with massive collections of music. MusicSurfer automatically extracts descriptions related to instrumentation, rhythm and harmony from music audio signals. Together with efficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 15, Issue 3

June 2024

646 pages

EISSN:2157-6912

DOI:10.1145/3613609

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2024

Online AM: 06 March 2024

Accepted: 14 February 2024

Revised: 27 November 2023

Received: 29 May 2023

Published in TIST Volume 15, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
270
Total Downloads

Downloads (Last 12 months)270
Downloads (Last 6 weeks)39

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents