skip to main content
research-article

Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation

Published: 15 April 2024 Publication History

Abstract

Object-oriented micro-video background music recommendation is a complicated task where the matching degree between videos and background music is a major issue. However, music selections in user-generated content (UGC) are prone to selection bias caused by historical preferences of uploaders. Since historical preferences are not fully reliable and may reflect obsolete behaviors, over-reliance on them should be avoided as knowledge and interests dynamically evolve. In this article, we propose a Deconfounded Cross-Modal matching model to mitigate such bias. Specifically, uploaders’ personal preferences of music genres are identified as confounders that spuriously correlate music embeddings and background music selections, causing the learned system to over-recommend music from majority groups. To resolve such confounders, backdoor adjustment is utilized to deconfound the spurious correlation between music embeddings and prediction scores. We further utilize Monte Carlo estimator with batch-level average as the approximations to avoid integrating the entire confounder space calculated by the adjustment. Furthermore, we design a teacher–student network to utilize the matching of music videos, which is professionally generated content (PGC) with specialized matching, to better recommend content-matching background music. The PGC data are modeled by a teacher network to guide the matching of uploader-selected UGC data of student network by Kullback–Leibler–based knowledge transfer. Extensive experiments on the TT-150k-genre dataset demonstrate the effectiveness of the proposed method. The code is publicly available on https://github.com/jing-1/DecCM

References

[1]
Shoshana Abramovich and Lars-Erik Persson. 2016. Some new estimates of the ‘Jensen gap’. J. Inequal. Appl. 1 (2016), 1–9.
[2]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv:1609.08675. Retrieved from https://arxiv.org/abs/1609.08675
[3]
Fabiano M. Belém, Carolina S. Batista, Rodrygo L. T. Santos, Jussara M.Almeida, and Marcos A. Gonçalves. 2016. Beyond relevance: Explicitly promoting novelty and diversity in tag recommendation. ACM Trans. Intell. Syst. Technol. 7, 3 (2016), 1–34.
[4]
Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H. Chi, et al. 2019. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2212–2220.
[5]
Jiansong Chao, HaofenWang,Wenlei Zhou,Weinan Zhang, and Yong Yu. 2011. Tunesensor: A semantic-driven music recommendation service for digital photo albums. In Proceedings of the International Semantic Web Conference. 353–361.
[6]
Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to debias for recommendation. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 21–30.
[7]
Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2022. Bias and debias in recom-mender system: A survey and future directions. ACM Trans. Inf. Syst. 41, 3 (2022). 1–39.
[8]
Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. 2021. Video background music generation with controllable music transformer. In Proceedings of the ACM International Conference on Multimedia. 2037–2045.
[9]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 7 (2011). 2121–2159.
[10]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia. 1459–1462.
[11]
Prasenjeet Fulzele, Rajat Singh, Naman Kaushik, and Kavita Pandey. 2018. A hybrid model for music genre classification using LSTM and SVM. In Proceedings of the International Conference on Contemporary Computing.1–3.
[12]
Madelyn Glymour, Judea Pearl, and Nicholas P. Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer and Visual Pattern Recognition.770–778.
[14]
Xiangnan He, Yang Zhang, Fuli Feng, Chonggang Song, Lingling Yi, Guohui Ling, and Yongdong Zhang. 2022. Addressing confounding feature issue for causal recommendation. ACM Trans. Inf. Syst. 41, 3 (2022). 1–23.
[15]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.131–135.
[16]
Sungeun Hong, Woobin Im, and Hyun S. Yang. 2018. CBVMR: Content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the ACM International Conference on Multimedia Retrieval. 353–361.
[17]
Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations.
[18]
Jae-woong Lee, Seongmin Park, and Jongwuk Lee. 2021. Dual unbiased recommender learning for implicit feedback. In Proceedings of the Annual International ACM SIGIR Conference Research and Development in Information Retrieval. 1647–1651.
[19]
Bochen Li and Aparna Kumar. 2019. Query by video: Cross-modal music retrieval. In International Society for Music Information Retrieval Conference.604–611.
[20]
Cheng-Te Li, Cheng Hsu, and Yang Zhang. 2022. FairSR: Fairness-aware sequential recommendation through multi-task learning with preference graph embeddings. ACM Trans. Intell. Syst. Technol. 13, 1 (2022), 1–21.
[21]
Qian Li, Xiangmeng Wang, Zhichao Wang, and Guandong Xu. 2023. Be causal: De-biasing social network confounding in recommendation. ACM Trans. Knowl. Discov. Data 17, 1 (2023), 1–23.
[22]
Dawen Liang, Laurent Charlin, and David M. Blei. 2016. Causal inference for recommendation. In Proceedings of the Uncertainty Artificial Intelligence Workshop.
[23]
Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2015. EMV-matchmaker: Emotional temporal course modeling and matching for automatic music video generation. In Proceedings of the ACM International Conference on Multimedia. 899–902.
[24]
Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2016. Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In Proceedings of the ACM International Conference on Multimedia. 372–376.
[25]
Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2016. DEMV-matchmaker: Emotional temporal course representation and deep similarity matching for automatic music video generation. In IEEE International Conference on Acoustics, Speech, and Signal Processing.2772–2776.
[26]
Jen-Chun Lin, Wen-Li Wei, James Yang, Hsin-Min Wang, and Hong-Yuan Mark Liao. 2017. Automatic music video generation based on simultaneous soundtrack recommendation and video editing. In Proceedings of the ACM International Conference on Multimedia. 519–527.
[27]
Chien-Liang Liu and Ying-Chuan Chen. 2018. Background music recommendation based on latent factors and moods. Knowl. Bas, Syst. 159 (2018), 158–170.
[28]
Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A general knowledge distillation framework for counterfactual recommendation via uniform data. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 831–840.
[29]
Zhongzhou Liu, Yuan Fang, and Min Wu. 2022. Mitigating popularity bias for users and items with fairness-centric adaptive recommendation. ACM Trans. Inf. Syst. 41, 3 (2022). 1–27.
[30]
Nicholas Metropolis and S. Ulam. 1949. The monte carlo method. J. Am. Stat. Assoc. 44, 247 (1949), 335–341.
[31]
Laure Pretet, Gael Richard, Clement Souchier, and Geoffroy Peeters. 2022. Video-to-music recommendation using temporal alignment of segments. IEEE Trans. Multimedia (2022). DOI:
[32]
Fernando P. Santos, Yphtach Lelkes, and Simon A. Levin. 2021. Link recommendation algorithms and dynamics of polarization in online social networks. Proc. Natl. Acad. Sci. U.S.A. 118, 50 (2021).
[33]
Shoto Sasaki, Tatsunori Hirai, Hayato Ohya, and Shigeo Morishima. 2015. Affective music recommendation system based on the mood of input video. In Multimedia Modeling International Conference, Vol. 8936. 299–302.
[34]
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the International Conference on Machine Learning.1670–1679.
[35]
Lanyu Shang, Zhang Daniel Yue, Khan Siamul Karim, Jialie Shen, and Dong Wang. 2020. CaMR: Towards connotation-aware music retrieval on social media with visual inputs. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networking Analysis and Mining. 425–429.
[36]
Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, and Lawrence Carin. 2018. Deconvolutional latent-variable model for text sequence matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 5438–5445.
[37]
Paras Sheth, Ruocheng Guo, Lu Cheng, Huan Liu, and Kasim Selçuk Candan. 2023. Causal disentanglement for implicit recommendations with network information. ACM Trans. Knowl. Discov. Data (2023).
[38]
Mohammad Soleymani, Micheal N. Caro, Erik M. Schmidt, Cheng-Ya Sha, and Yi-Hsuan Yang. 2013. 1000 songs for emotional analysis of music. In Proceedings of the ACM International Workshop on Crowdsourcing Multimedia. 1–6.
[39]
Harald Steck. 2018. Calibrated recommendations. In Proceedings of the ACM Conference on Recommender Systems.154–162.
[40]
Keith Sunstein. 2006. Infotopia: How Many Minds Produce Knowledge.
[41]
Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i Nieto. 2018. Cross-modal embeddings for video and audio retrieval. In Proceedings of the European Conference on Computer Vision Workshops.
[42]
Robert E. Thayer. 1990. The Biopsychology of Mood and Arousal. Oxford University Press.
[43]
G. Tzanetakis. 1999. Gtzan Music/Speech Collection. Retrieved from http://marsyas.info/index.html
[44]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215
[45]
Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. 2021. Deconfounded recommendation for alleviating bias amplification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1717–1725.
[46]
Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1288–1297.
[47]
Yujia Wang, Wei Liang, Wanwan Li, Dingzeyu Li, and Lap-Fai Yu. 2020. Scene-aware background music synthesis. In Proceedings of the ACM International Conference on Multimedia. 1162–1170.
[48]
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.13005–13014.
[49]
Zhiquan Wen, Guanghui Xu, Mingkui Tan, Qingyao Wu, and Qi Wu. 2021. Debiased visual question answering from feature and sample perspectives. Proceedings of the Advances in Neural Information Processing Systems Conference. 34 (2021).
[50]
Qiong Wu, Yong Liu, Chunyan Miao, Binqiang Zhao, Yin Zhao, and Lu Guan. 2019. PD-GAN: Adversarial learning for personalized diversity-promoting recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 19. 3870–3876.
[51]
Chenxiao Yang, Qitian Wu, Qingsong Wen, Zhiqiang Zhou, Liang Sun, and Junchi Yan. 2022. Towards out-of-distribution sequential event prediction: A causal treatment. In Proceedings of the Advances in Neural Information Processing Systems Conference.
[52]
Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the ACM Conference on Recommender Systems.279–287.
[53]
Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Trans. Knowl. Discov. Data 15, 5 (2021), 1–46.
[54]
Jing Yi, Yaochen Zhu, Jiayi Xie, and Zhenzhong Chen. 2021. Cross-modal variational auto-encoder for content-based micro-video background music recommendation. IEEE Trans. Multimedia 25 (2021). 2898–2911.
[55]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2021. Towards debiasing temporal sentence grounding in video. arXiv:2111.04321.
[56]
Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–20.
[57]
Yihao Zhang, Chu Zhao, Weiwen Liao, Wei Zhou, and Meng Yuan. 2023. Asymmetrical attention networks fused autoencoder for debiased recommendation. ACM Trans. Intell. Syst. Technol. 14, 6 (2023). 1–24.
[58]
Boxiang Zhao, Shuliang Wang, Lianhua Chi, Qi Li, Xiaojia Liu, and Jing Geng. 2023. Causal discovery via causal star graphs. ACM Trans. Knowl. Discov. Data 17, 7 (2023). 1–24.
[59]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.10394–10403.

Index Terms

  1. Deconfounded Cross-modal Matching for Content-based Micro-video Background Music Recommendation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
    June 2024
    646 pages
    EISSN:2157-6912
    DOI:10.1145/3613609
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 April 2024
    Online AM: 06 March 2024
    Accepted: 14 February 2024
    Revised: 27 November 2023
    Received: 29 May 2023
    Published in TIST Volume 15, Issue 3

    Check for updates

    Author Tags

    1. Cross-modal matching
    2. debiased recommender systems
    3. knowledge distillation
    4. variational auto-encoder

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 270
      Total Downloads
    • Downloads (Last 12 months)270
    • Downloads (Last 6 weeks)39
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media