ABSTRACT
The cost of hardware and energy consumption on training a click-through rate (CTR) model is highly prohibitive. A recent promising direction for reducing such costs is data distillation with gradient matching, which aims to synthesize a small distilled dataset to guide the model to a similar parameter space as those trained on real data. However, there are two main challenges to implementing such a method in the recommendation field: (1) The categorical recommended data are high dimensional and sparse one- or multi-hot data which will block the gradient flow, causing backpropagation-based data distillation invalid. (2) The data distillation process with gradient matching is computationally expensive due to the bi-level optimization. To this end, we investigate efficient data distillation tailored for recommendation data with plenty of side information where we formulate the discrete data to the dense and continuous data format. Then, we further introduce a one-step gradient matching scheme, which performs gradient matching for only a single step to overcome the inefficient training process. The overall proposed method is called Categorical data distillation with Gradient Matching (CGM), which is capable of distilling a large dataset into a small of informative synthetic data for training CTR models from scratch. Experimental results show that our proposed method not only outperforms the state-of-the-art coreset selection and data distillation methods but also has remarkable cross-architecture performance. Moreover, we explore the application of CGM on model retraining and mitigate the effect of different random seeds on the training results.
- George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. 2022. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4750–4759.Google Scholar
- Suming J Chen, Zhen Qin, Zac Wilson, Brian Calaci, Michael Rose, Ryan Evans, Sean Abraham, Donald Metzler, Sandeep Tata, and Michael Colagrosso. 2020. Improving recommendation quality in google drive. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2900–2908.Google ScholarDigital Library
- Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.Google ScholarDigital Library
- Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision. 4794–4802.Google ScholarCross Ref
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.Google ScholarDigital Library
- James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. 293–296.Google ScholarDigital Library
- Tian Dong, Bo Zhao, and Lingjuan Lyu. 2022. Privacy for free: How does dataset condensation help privacy?. In International Conference on Machine Learning. PMLR, 5378–5396.Google Scholar
- Dan Feldman, Melanie Schmidt, and Christian Sohler. 2020. Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM J. Comput. 49, 3 (2020), 601–657.Google ScholarDigital Library
- Chengcheng Guo, Bo Zhao, and Yanbing Bai. 2022. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications. Springer, 181–195.Google ScholarDigital Library
- Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.Google ScholarCross Ref
- Zhexue Huang. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery 2, 3 (1998), 283–304.Google Scholar
- Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations.Google Scholar
- Wei Jin, Xianfeng Tang, Haoming Jiang, Zheng Li, Danqing Zhang, Jiliang Tang, and Bing Yin. 2022. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 720–730.Google ScholarDigital Library
- Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.Google ScholarDigital Library
- Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. 2022. Dataset condensation with contrastive signals. In International Conference on Machine Learning. PMLR, 12352–12364.Google Scholar
- Zeyu Li, Wei Cheng, Yang Chen, Haifeng Chen, and Wei Wang. 2020. Interpretable click-through rate prediction through hierarchical attention. In Proceedings of the 13th International Conference on Web Search and Data Mining. 313–321.Google ScholarDigital Library
- Yang Liu, Cheng Lyu, Zhiyuan Liu, and Dacheng Tao. 2019. Building effective short video recommendation. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 651–656.Google ScholarCross Ref
- Andrew L Maas, Awni Y Hannun, Andrew Y Ng, 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30. Citeseer, 3.Google Scholar
- Rui Maia and Joao C Ferreira. 2018. Context-aware food recommendation system. Context-aware food recommendation system (2018), 349–356.Google Scholar
- Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning. PMLR, 6950–6960.Google Scholar
- Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems 34 (2021), 20596–20607.Google Scholar
- Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2001–2010.Google ScholarCross Ref
- Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.Google ScholarDigital Library
- Noveen Sachdeva, Mehak Dhaliwal, Carole-Jean Wu, and Julian McAuley. 2022. Infinite recommendation networks: A data-centric approach. Advances in Neural Information Processing Systems 35 (2022), 31292–31305.Google Scholar
- Noveen Sachdeva, Carole-Jean Wu, and Julian McAuley. 2022. On Sampling Collaborative Filtering Datasets. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 842–850.Google ScholarDigital Library
- Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations.Google Scholar
- Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. 2022. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems 35 (2022), 19523–19536.Google Scholar
- Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2018. An Empirical Study of Example Forgetting during Deep Neural Network Learning. In International Conference on Learning Representations.Google Scholar
- Qinyong Wang, Hongzhi Yin, Hao Wang, Quoc Viet Hung Nguyen, Zi Huang, and Lizhen Cui. 2019. Enhancing collaborative filtering with generative augmentation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 548–556.Google ScholarDigital Library
- Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. 1–7.Google ScholarDigital Library
- Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. 2018. Dataset distillation. arXiv preprint arXiv:1811.10959 (2018).Google Scholar
- Yaozheng Wang, Dawei Feng, Dongsheng Li, Xinyuan Chen, Yunxiang Zhao, and Xin Niu. 2016. A mobile recommendation system based on logistic regression and gradient boosting decision trees. In 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 1896–1902.Google ScholarCross Ref
- Max Welling. 2009. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning. 1121–1128.Google ScholarDigital Library
- Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H Chi. 2020. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020. 441–447.Google ScholarDigital Library
- Jianyu Zhang and Françoise Fogelman-Soulié. 2018. KKbox’s music recommendation challenge solution with feature engineering. In 11th ACM International Conference on Web Search and Data Mining WSDM.Google Scholar
- Yang Zhang, Fuli Feng, Chenxu Wang, Xiangnan He, Meng Wang, Yan Li, and Yongdong Zhang. 2020. How to retrain recommender system? A sequential meta-learning method. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1479–1488.Google Scholar
- Bo Zhao and Hakan Bilen. 2021. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning. PMLR, 12674–12685.Google Scholar
- Bo Zhao and Hakan Bilen. 2023. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6514–6523.Google ScholarCross Ref
- Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2021. Dataset Condensation with Gradient Matching.ICLR 1, 2 (2021), 3.Google Scholar
- Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. Aibox: Ctr prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 319–328.Google ScholarDigital Library
- Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.Google ScholarDigital Library
- Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open Benchmarking for Click-Through Rate Prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2759–2769.Google ScholarDigital Library
- Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems 32 (2019).Google Scholar
Index Terms
- Gradient Matching for Categorical Data Distillation in CTR Prediction
Recommendations
Data-free Knowledge Distillation for Reusing Recommendation Models
RecSys '23: Proceedings of the 17th ACM Conference on Recommender SystemsA common practice to keep the freshness of an offline Recommender System (RS) is to train models that fit the user’s most recent behaviour while directly replacing the outdated historical model. However, many feature engineering and computing resources ...
Gift from Nature: Potential Energy Minimization for Explainable Dataset Distillation
Computer Vision – ACCV 2022 WorkshopsAbstractDataset distillation aims to reduce the dataset size by capturing important information from original dataset. It can significantly improve the feature extraction effectiveness, storage efficiency and training robustness. Furthermore, we study the ...
New Properties of the Data Distillation Method When Working with Tabular Data
Analysis of Images, Social Networks and TextsAbstractData distillation is the problem of reducing the volume of training data while keeping only the necessary information. With this paper, we deeper explore the new data distillation algorithm, previously designed for image data. Our experiments with ...
Comments