Counterfactual contextual bandit for recommendation under delayed feedback

Cai, Ruichu; Lu, Ruming; Chen, Wei; Hao, Zhifeng

doi:10.1007/s00521-024-09800-0

Counterfactual contextual bandit for recommendation under delayed feedback

Original Article
Published: 09 May 2024

Volume 36, pages 14599–14613, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ruichu Cai^1,2,
Ruming Lu¹,
Wei Chen ORCID: orcid.org/0000-0002-8213-0567¹ &
…
Zhifeng Hao^1,3

260 Accesses
Explore all metrics

Abstract

The recommendation system has far-reaching significance and great practical value, which alleviates people’s troubles about choosing from a huge amount of information. The existing recommendation system usually faces the selection bias problem due to the ignorance of samples with delayed feedback. To alleviate this problem, by modeling the recommendation as a batch contextual bandit problem, we propose a counterfactual reward estimation approach in this work. First, we formalize the counterfactual problem as “would the user be interested in the recommended item if the delayed time is before the collection time point?". The above counterfactual reward is estimated in a survival analysis framework, by fully exploring the causal generation process of user feedback on batch data. Second, based on the above estimated counterfactual rewards, the policy of batch contextual bandit is updated for online recommendation in the next episode. Third, new batch data are generated in the online recommendation for further counterfactual reward estimation. The above three steps are iteratively conducted until the optimal policy is learned. We also prove the sub-linear regret bound of the learned bandit policy theoretically. Our method achieved a $4\%$ improvement in average reward compared to the baseline methods in experiments conducted on synthetic and Criteo datasets, demonstrating the efficacy of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Thompson Sampling with Time-Varying Reward for Contextual Bandits

Transferable Contextual Bandits with Prior Observations

A Contextual Bandit Approach to Personalized Online Recommendation via Sparse Interactions

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Data availability

The data sets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Wu L, He X, Wang X, Zhang K, Wang M (2023) A survey on accuracy-oriented neural recommendation: from collaborative filtering to information-rich recommendation. IEEE Trans Knowl Data Eng 35(5):4425–4445. https://doi.org/10.1109/TKDE.2022.3145690
Article Google Scholar
Wang S, Hu L, Wang Y, Cao L, Sheng QZ, Orgun M (2019) Sequential recommender systems: Challenges, progress and prospects. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6332–6338. https://doi.org/10.24963/ijcai.2019/883
Zheng G, Zhang F, Zheng Z, Xiang Y, Yuan NJ, Xie X, Li Z (2018) Drn: A deep reinforcement learning framework for news recommendation. In: Proceedings of the 2018 World Wide Web Conference, 167–176. https://doi.org/10.1145/3178876.3185994
Shams S, Anderson D, Leith D (2021) Cluster-based bandits: Fast cold-start for recommender system new users. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3463033
Wu S, Wang Y, Jing Q, Dong D, Dou D, Yao Q (2023) Coldnas: Search to modulate for user cold-start recommendation. In: Proceedings of the ACM Web Conference 2023, pp. 1021–1031. https://doi.org/10.1145/3543507.3583344
Chu Z, Wang H, Xiao Y, Long B, Wu L (2023) Meta policy learning for cold-start conversational recommendation. In: Proceedings of the ACM Web Conference 2023, pp. 1021–1031. https://doi.org/10.1145/3539597.3570443
Alabduljabbar R, Alshareef M, Alshareef N (2023) Time-aware recommender systems: A comprehensive survey and quantitative assessment of literature. IEEE Access 45586–45604. https://doi.org/10.1109/ACCESS.2023.3274117
Ghouchan Nezhad Noor Nia R, Jalali M (2022) Recmem: Time aware recommender systems based on memetic evolutionary clustering algorithm. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2022/8714870
Joulani P, Gyorgy A, Szepesvári C (2013) Online learning under delayed feedback. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. 1453–1461
Pike-Burke C, Agrawal S, Szepesvari C, Grunewalder S (2018) Bandits with delayed, aggregated anonymous feedback. In: International Conference on Machine Learning, pp. 4105–4113
Grover A, Markov T, Attia P, Jin N, Perkins N, Cheong B, Chen M, Yang Z, Harris S, Chueh W, Ermon S (2018) Best arm identification in multi-armed bandits with delayed feedback. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 833–842
Neu G, Olkhovskaya J (2020) Efficient and robust algorithms for adversarial linear contextual bandits. In: Proceedings of Thirty Third Conference on Learning Theory 3049–3068
Zhang X, Jia H, Su H, Wang W, Xu J, Wen J-R (2021) Counterfactual reward modification for streaming recommendation with delayed feedback. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41–50. https://doi.org/10.1145/3404835.3462892
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web, pp. 661–670. https://doi.org/10.1145/1772690.1772758
Li S, Karatzoglou A, Gentile C (2016) Collaborative filtering bandits. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 539–548. https://doi.org/10.1145/2911451.2911548
Vernade C, Cappé O, Perchet V (2017) Stochastic bandit models for delayed conversions. In: UAI
Zhang K, Janson L, Murphy S (2020) Inference for batched bandits. Advances in neural information processing systems 9818–9829
Héliou A, Mertikopoulos P, Zhou Z (2020) Gradient-free online learning in continuous games with delayed rewards. In: International Conference on Machine Learning 4172–4181
Advances and challenges in conversational recommender systems (2021) A survey. AI Open 100–126. https://doi.org/10.1016/j.aiopen.2021.06.002
Xu Y, Chen N, Fernandez A, Sinno O, Bhasin A (2015) From infrastructure to culture: A/b testing challenges in large scale social networks. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227–2236. https://doi.org/10.1145/2783258.2788602
Schnabel T, Bennett PN, Dumais ST, Joachims T (2018) Short-term satisfaction and long-term coverage: Understanding how users tolerate algorithmic exploration. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 513–521. https://doi.org/10.1145/3159652.3159700
He X, Pan J, Jin O, Xu T, Liu B, Xu T, Shi Y, Atallah A, Herbrich R, Bowers S, Quiñonero-Candela J (2014) Practical lessons from predicting clicks on ads at facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 5–159. https://doi.org/10.1145/2648584.2648589
Wang W, Lin X, Feng F, He X, Lin M, Chua T-S (2022) Causal representation learning for out-of-distribution recommendation. In: Proceedings of the ACM Web Conference 2022, pp. 3562–3571. https://doi.org/10.1145/3485447.3512251
Khaledian N, Mardukhi F (2022) Cfmt: a collaborative filtering approach based on the nonnegative matrix factorization technique and trust relationships. Journal of Ambient Intelligence and Humanized Computing, 1–17 . https://doi.org/10.1007/s12652-021-03368-6
Khaledian N, Nazari A, Khamforoosh K, Abualigah L, Javaheri D (2023) Trustdl: use of trust-based dictionary learning to facilitate recommendation in social networks. Exp Syst with Appl 128:120487. https://doi.org/10.1016/j.eswa.2023.120487
Article Google Scholar
Heidari N, Moradi P, Koochari A (2022) An attention-based deep learning method for solving the cold-start and sparsity issues of recommender systems. Knowl Based Syst 256:109835. https://doi.org/10.1016/j.knosys.2022.109835
Article Google Scholar
Sánchez-Moreno D, Zheng Y, Moreno-García MN (2020) Time-aware music recommender systems: modeling the evolution of implicit user preferences and user listening habits in a collaborative filtering approach. Appl Sci 10(15):5324. https://doi.org/10.3390/app10155324
Article Google Scholar
Bao J, Zhang Y (2021) Time-aware recommender system via continuous-time modeling. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2872–2876 . https://doi.org/10.1145/3459637.3482202
Wang Y, Liang D, Charlin L, Blei DM (2020) Causal inference for recommender systems. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 426–431. https://doi.org/10.1145/3383313.3412225
Wang W, Zhang Y, Li H, Wu P, Feng F, He X (2023) Causal recommendation: Progresses and future directions. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3432–3435. https://doi.org/10.1145/3539618.3594245
Li Q, Wang X, Wang Z, Xu G (2023) Be causal: de-biasing social network confounding in recommendation. ACM Trans Knowl Discover Data 17(1):1–23. https://doi.org/10.1145/3533725
Article Google Scholar
Wei T, Feng F, Chen J, Wu Z, Yi J, He X (2021) Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1791–1800. https://doi.org/10.1145/3447548.3467289
Zhang Y, Feng F, He X, Wei T, Song C, Ling G, Zhang Y (2021) Causal intervention for leveraging popularity bias in recommendation. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–20. https://doi.org/10.1145/3404835.3462875
He X, An B, Li Y, Chen H, Guo Q, Li X, Wang Z (2020) Contextual user browsing bandits for large-scale online mobile recommendation. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 63–72. https://doi.org/10.1145/3383313.3412234
Guo D, Ktena SI, Myana PK, Huszar F, Shi W, Tejani A, Kneier M, Das S (2020) Deep bayesian bandits: Exploring in online personalized recommendations. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 456–461. https://doi.org/10.1145/3383313.3412214
Yao T, Yi X, Cheng DZ, Yu F, Chen T, Menon A, Hong L, Chi EH, Tjoa S, Kang J, Ettinger E (2021) Self-supervised learning for large-scale item recommendations. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4321–4330. https://doi.org/10.1145/3459637.3481952
Zhang Y, Cheng DZ, Yao T, Yi X, Hong L, Chi EH (2021) A model of two tales: Dual transfer learning framework for improved long-tail item recommendation. In: Proceedings of the Web Conference 2021, pp. 2220–2231. https://doi.org/10.1145/3442381.3450086
Liu S, Zheng Y (2020) Long-tail session-based recommendation. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 509–514. https://doi.org/10.1145/3383313.3412222
Barraza-Urbina A, Glowacka D (2020) Introduction to bandits in recommender systems. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 748–750. https://doi.org/10.1145/3383313.3411547
Kuang K, Li L, Geng Z, Xu L, Zhang K, Liao B, Huang H, Ding P, Miao W, Jiang Z (2020) Causal inference. Engineering, 253–263 . https://doi.org/10.1016/j.eng.2019.08.016
Pearl J, Mackenzie D (2018) The book of why: The new science of cause and effect. Science 855. https://doi.org/10.1126/science.aau9731
Pearl J (2009) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge, UK
Book Google Scholar
Yao L, Chu Z, Li S, Li Y, Gao J, Zhang A (2021) A survey on causal inference. ACM Trans Knowl Discov Data (TKDD) 15(5):1–46. https://doi.org/10.1145/3444944
Article Google Scholar
Glass TA, Goodman SN, Hernán MA, Samet JM (2013) Causal inference in public health. Annual Rev Public Health 34:61–75. https://doi.org/10.1146/annurev-publhealth-031811-124606
Article Google Scholar
Pearl J (2009) Causal inference in statistics: An overview. Stat Surv 3:96–146
Article MathSciNet Google Scholar
Peters J, Janzing D, Schlkopf B (2017) Elements of causal inference: Foundations and learning algorithms
Chernozhukov V, Fernández-Val I, Melly B (2013) Inference on counterfactual distributions. Econometrica 81(6):2205–2268. https://doi.org/10.3982/ECTA10582
Article MathSciNet Google Scholar
Saito Y, Joachims T (2021) Counterfactual learning and evaluation for recommender systems: Foundations, implementations, and recent advances. Fifteenth ACM Conference on Recommender Systems 828–830. https://doi.org/10.1145/3460231.3473320
Yang M, Dai Q, Dong Z, Chen X, He X, Wang J (2021) Top-n recommendation with counterfactual user preference simulation. CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, 2342–2351. https://doi.org/10.1145/3459637.3482305
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. 127–135
Han Y, Zhou Z, Zhou Z, Blanchet JH, Glynn PW, Ye Y (2020) Sequential batch learning in finite-action linear contextual bandits. ArXiv . https://doi.org/10.48550/arXiv.2004.06321
Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. The Annals of Statistics 660–681
Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 503–513
Yoshikawa Y, Imai Y (2018) A nonparametric delayed feedback model for conversion rate prediction. ArXiv. https://doi.org/10.48550/arXiv.1802.00255
Jenkins SP (2005) Survival analysis. Unpublished manuscript, Institute for Social and Economic Research, University of Essex, Colchester, UK
Sinha NK, Griscik MP (1971) A stochastic approximation method. IEEE Trans Syst, Man, Cybernet 4:338–344. https://doi.org/10.1109/TSMC.1971.4308316
Article MathSciNet Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159
MathSciNet Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. CoRR. https://doi.org/10.48550/arXiv.1412.6980
Wang C-H, Cheng G (2020) Online batch decision-making with high-dimensional covariates. International Conference on Artificial Intelligence and Statistics, 3848–3857
Walsh TJ, Szita I, Diuk C, Littman ML (2009) Exploring compact reinforcement-learning representations with linear regression. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 591–598
Balseiro S, Golrezaei N, Mahdian M, Mirrokni V, Schneider J (2019) Contextual bandits with cross-learning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 9679–9688
Sezener E, Hutter M, Budden D, Wang J, Veness J (2020) Online learning in contextual bandits using gated linear networks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 19467–19477
Bistritz I, Zhou Z, Chen X, Bambos N, Blanchet J (2019) Online exp3 learning in adversarial bandits with delayed feedback. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 11349–11358
Chapelle O (2014) Modeling delayed feedback in display advertising. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1097–1105. https://doi.org/10.1145/2623330.2623634
Vakili S, Ahmed D, Bernacchia A, Pike-Burke C (2023) Delayed feedback in kernel bandits. In: Proceedings of the 40th International Conference on Machine Learning
Saito Y, Morisihta G, Yasui S (2020) Dual learning algorithm for delayed conversions. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1849–1852. https://doi.org/10.1145/3397271.3401282

Download references

Acknowledgements

This research was supported in part by the National Science and Technology Major Project (2021ZD0111500), the National Science Fund for Excellent Young Scholars (62122022), Natural Science Foundation of China (62206064), the major key project of PCL (PCL2021A12).

Author information

Authors and Affiliations

School of Computer Science, Guangdong University of Technology, Guangzhou, 510006, China
Ruichu Cai, Ruming Lu, Wei Chen & Zhifeng Hao
Peng Cheng Laboratory, Shenzhen, 518066, China
Ruichu Cai
College of Mathematics and Computer Science, Shantou University, Shantou, 515063, China
Zhifeng Hao

Authors

Ruichu Cai
View author publications
You can also search for this author inPubMed Google Scholar
Ruming Lu
View author publications
You can also search for this author inPubMed Google Scholar
Wei Chen
View author publications
You can also search for this author inPubMed Google Scholar
Zhifeng Hao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Ruichu Cai: Conceptualization, Methodology, Investigation Supervision, Project administration, Funding acquisition; Ruming Lu: Methodology, Software, Writing - Original Draft, Visualization; Wei Chen: Methodology, Formal analysis, Writing - Review & Editing, Funding acquisition; Zhifeng Hao: Investigation, Supervision, Funding acquisition.

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Not applicable.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cai, R., Lu, R., Chen, W. et al. Counterfactual contextual bandit for recommendation under delayed feedback. Neural Comput & Applic 36, 14599–14613 (2024). https://doi.org/10.1007/s00521-024-09800-0

Download citation

Received: 03 May 2023
Accepted: 25 March 2024
Published: 09 May 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s00521-024-09800-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Counterfactual contextual bandit for recommendation under delayed feedback

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Thompson Sampling with Time-Varying Reward for Contextual Bandits

Transferable Contextual Bandits with Prior Observations

A Contextual Bandit Approach to Personalized Online Recommendation via Sparse Interactions

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now