Skip to main content

Parameters Efficient Fine-Tuning for Long-Tailed Sequential Recommendation

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Abstract

In an era of information explosion, recommendation systems play an important role in people’s daily life by facilitating content exploration. It is known that user activeness, i.e., number of behaviors, tends to follow a long-tail distribution, where the majority of users are with low activeness. In practice, we observe that tail users suffer from significantly lower-quality recommendation than the head users after joint training. We further identify that a model trained on tail users separately still achieve inferior results due to limited data. Though long-tail distributions are ubiquitous in recommendation systems, improving the recommendation performance on the tail users still remains challenge in both research and industry. Directly applying related methods on long-tail distribution might be at risk of hurting the experience of head users, which is less affordable since a small portion of head users with high activeness contribute a considerate portion of platform revenue. In this paper, we propose a novel approach that significantly improves the recommendation performance of the tail users while achieving at least comparable performance for the head users over the base model. The essence of this approach is a novel Gradient Aggregation technique that learns common knowledge shared by all users into a backbone model, followed by separate plugin prediction networks for the head users and the tail users personalization. As for common knowledge learning, we leverage the backward adjustment from the causality theory for deconfounding the gradient estimation and thus shielding off the backbone training from the confounder, i.e., user activeness. We conduct extensive experiments on two public recommendation benchmark datasets and a large-scale industrial datasets collected from the Alipay platform. Empirical studies validate the rationality and effectiveness of our approach.

Z. Lv and F. Wang—These authors equally contributed to this study.

This work was supported in part by National Natural Science Foundation of China (62006207, 62037001, U20A20387), Young Elite Scientists Sponsorship Program by CAST (2021QNRC001), Zhejiang Province Natural Science Foundation (LQ21F020020), Project by Shanghai AI Laboratory (P22KS00111), Program of Zhejiang Province Science and Technology (2022C01044), the StarryNight Science Fund of Zhejiang University Shanghai Institute for Advanced Study (SN-ZJU-SIAS-0010), and the Fundamental Research Funds for the Central Universities (226-2022-00142, 226-2022-00051).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://grouplens.org/datasets/movielens/.

  2. 2.

    http://jmcauley.ucsd.edu/data/amazon/.

  3. 3.

    https://www.alipay.com/.

References

  1. Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: International Conference on Machine Learning, pp. 794–803. PMLR (2018)

    Google Scholar 

  2. Dong, M., Yuan, F., Yao, L., Xu, X., Zhu, L.: Mamo: memory-augmented meta-optimization for cold-start recommendation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 688–697 (2020)

    Google Scholar 

  3. Glymour, M., Pearl, J., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley, Hoboken (2016)

    Google Scholar 

  4. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. In: International Conference on Learning Representations 2016 (2016)

    Google Scholar 

  5. Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384 (2016)

    Google Scholar 

  6. Huang, R., et al.: Audiogpt: understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995 (2023)

  7. Kang, W.C., McAuley, J.: Self-attentive sequential recommendation. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. IEEE (2018)

    Google Scholar 

  8. Krichene, W., Rendle, S.: On sampled metrics for item recommendation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1748–1757 (2020)

    Google Scholar 

  9. Lee, H., Im, J., Jang, S., Cho, H., Chung, S.: MeLU: meta-learned user preference estimator for cold-start recommendation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1073–1082 (2019)

    Google Scholar 

  10. Li, M., et al.: Winner: weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23090–23099 (2023)

    Google Scholar 

  11. Li, M., et al.: End-to-end modeling via information tree for one-shot natural language spatial video grounding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8707–8717 (2022)

    Google Scholar 

  12. Lv, Z., et al.: Ideal: toward high-efficiency device-cloud collaborative and dynamic recommendation system. arXiv preprint arXiv:2302.07335 (2023)

  13. Lv, Z., et al.: Duet: a tuning-free device-cloud collaborative parameters generation framework for efficient device model generalization. In: Proceedings of the ACM Web Conference 2023 (2023)

    Google Scholar 

  14. Mansilla, L., Echeveste, R., Milone, D.H., Ferrante, E.: Domain generalization via gradient surgery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6630–6638 (2021)

    Google Scholar 

  15. McAuley, J.J., Targett, C., Shi, Q., Hengel, A.V.D.: Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015 (2015)

    Google Scholar 

  16. Neuberg, L.G.: Causality: models, reasoning, and inference, by Judea Pearl, Cambridge University Press, 2000. Econom. Theory 19(4), 675–685 (2003)

    Google Scholar 

  17. Ouyang, W., Wang, X., Zhang, C., Yang, X.: Factors in finetuning deep model for object detection with long-tail distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 864–873 (2016)

    Google Scholar 

  18. Pan, F., Li, S., Ao, X., Tang, P., He, Q.: Warm up cold-start advertisements: improving CTR predictions via learning to learn id embeddings. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 695–704 (2019)

    Google Scholar 

  19. Pearl, J.: Causal diagrams for empirical research. Biometrika 82(4), 669–688 (1995)

    Article  MathSciNet  Google Scholar 

  20. Pearl, J.: Causality. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  21. Tong, Y., et al.: Quantitatively measuring and contrastively exploring heterogeneity for domain generalization. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2023)

    Google Scholar 

  22. Wang, Y.X., Ramanan, D., Hebert, M.: Learning to model the tail. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 7032–7042 (2017)

    Google Scholar 

  23. Wang, Z., Tsvetkov, Y., Firat, O., Cao, Y.: Gradient vaccine: investigating and improving multi-task optimization in massively multilingual models. arXiv preprint arXiv:2010.05874 (2020)

  24. Yin, J., Liu, C., Wang, W., Sun, J., Hoi, S.C.: Learning transferrable parameters for long-tailed sequential user behavior modeling. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 359–367 (2020)

    Google Scholar 

  25. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782 (2020)

  26. Zhang, S., Yao, D., Zhao, Z., Chua, T., Wu, F.: Causerec: counterfactual user sequence synthesis for sequential recommendation. In: SIGIR 2021: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021, pp. 367–377. ACM (2021)

    Google Scholar 

  27. Zhang, Y., Kang, B., Hooi, B., Yan, S., Feng, J.: Deep long-tailed learning: a survey. arXiv preprint arXiv:2110.04596 (2021)

  28. Zhang, Y., et al.: Online adaptive asymmetric active learning for budgeted imbalanced data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2768–2777 (2018)

    Google Scholar 

  29. Zhang, Z., Pfister, T.: Learning fast sample re-weighting without reward data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 725–734 (2021)

    Google Scholar 

  30. Zhou, G., et al.: Deep interest network for click-through rate prediction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068 (2018)

    Google Scholar 

  31. Zhu, D., et al.: Bridging the gap: neural collapse inspired prompt tuning for generalization under class imbalance. arXiv preprint arXiv:2306.15955 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shengyu Zhang , Kun Kuang or Fei Wu .

Editor information

Editors and Affiliations

Appendices

A Pseudo Code

The pseudo code of our proposed method is summarized in Algorithm 1.

figure a

B Experiments

1.1 B.1 Experiments Settings

Datasets

MovielensFootnote 1. MovieLens is a widely used public benchmark on movie ratings. In our experiments, we use movielens-1M which contains one million samples.

AmazonFootnote 2. Amazon Review dataset [15] is a widely-known recommendation benchmark. We use the Amazon-Books dataset for evaluation.

Alipay. We collect a larger-scale industrial dataset for online evaluation from the AliPay platformFootnote 3. Applets such as mobile recharge service are treated as items. For each user, clicked applets are treated as positives and other applets exposed to the user are negatives.

The detailed statistics of these datasets are summarized in Table 3.

Table 3. Statistics of the evaluation datasets.

Evaluation Metrics

$$\begin{aligned} \begin{aligned} & \textrm{AUC}=\frac{\sum _{x_0\in \mathcal {D}_T} \sum _{x_1 \in \mathcal {D}_F}\mathbbm {1}[f(x_1)<f(x_0)]}{|\mathcal {D}_T||\mathcal {D}_F|}, \\ & \text {HitRate}@K = \frac{1}{|\mathcal {U}|}\sum _{u\in \mathcal {U}} \mathbbm {1}(R_{u,g_u}\le K), \end{aligned} \end{aligned}$$
(14)

where \(\mathbbm {1}(\cdot )\) is the indicator function, f is the model to be evaluated, \(R_{u,g_u}\) is the rank predicted by the model for the ground truth item \(g_u\) and user u, and \(\mathcal {D}_T\), \(\mathcal {D}_F\) is the positive and negative testing sample set respectively.

Baselines

GRU4Rec [4] is one of the early works that introduce recurrent neural networks to model user behavior sequences in recommendation.

DIN [30] introduces a target-attention mechanism for historically interacted items aggregation for click-through-rate prediction.

SASRec [7] is a representative sequential modeling method based on self-attention mechanisms. It simultaneously predicts multiple next-items by masking the backward connections in the attention map.

To evaluate the effectiveness on tail user modeling, the following competing methods are introduced for comparison.

Agr-Rand [14] introduced a gradient surgery strategy to solve the domain generalization problem by coordinating inter-domain gradients to update neural weights in common consistent directions to create a more robust image classifier.

PCGrad [25] is a very classic gradient surgery model that mitigates the negative cosine similarity problem by projecting the gradients of one task onto the normal components of the gradients of the other task by removing the disturbing components to mitigate gradient conflicts.

Grad-Transfer [24] adjusts the weight of each user during training through resample and gradient alignment, and adopts an adversarial learning method to avoid the model from using the sensitive information of user activity group in prediction to solve the long-tail problem.

Implementation Details

Preprocessing. On the Alipay dataset, the dates of all samples in the dataset are from 2021-5-19 to 2021-7-10. In order to simulate the real a/b testing environment, we use the date to divide the dataset. We take the data before 0:00 AM in 2021-7-1 as the training set, and vice versa as the test set. On Movielens and Amazon datasets, we treat the labels of all user-item pairs in the dataset as 1, and the labels of user-item pairs that have not appeared as 0. We take the user’s last sample as the test set. On Movielens, we use positive samples in the training set: the ratio of negative samples = 1:4 to sample negative samples. In the test set, we refer to [8], so we use all negative samples of a user as the test set. In Amazon’s training set, we sample negative samples with the ratio of positive samples: negative samples = 1:4, and this ratio becomes 1:99 in the test set. We also filter out all users and items in Amazon with less than 15 clicks to reduce the dataset. On Alipay and Amazon datasets, we group by the number of samples of users. On the Movielens dataset, we group by the length of the user’s click sequence

Implementation. In terms of hardware, our models are trained on workstations equipped with NVidia Tesla V100 GPUs. For all datasets and all models, the batch size is set to 512. The loss function is optimized by the Adam optimizer with a learning rate of 0.001 for the gradient aggregation learning stage and 0.0001 for the plugin model learning stage. The training is stopped when the loss converges on the validation set.

Fig. 4.
figure 4

Performance of Hit@1 on validation set with different epochs.

1.2 B.2 Results

According to Fig. 4, the larger the group number, the more active the user is, that is, the first group is the least active user group, and the fifth group is the most active user group. Training the plugin network for relatively inactive groups of users requires only a small number of epochs to be optimal (such as groups 1 and 2). The training curve of the user group with relatively high activity level has a more stable upward trend with the increase of epoch (such as group 3, group 4 and group 5). This is mainly due to the difference in the amount of personalized information in user groups with different levels of activity. For a user group with a large amount of data, more personalized information is required, and more epochs are needed to learn the personalized information. Otherwise, only a few epochs are required.

Toy Example. In Fig. 5, we give a toy example of long-tail effect of a real case in Alipay platform and show the improvement brought by our propose method. In this case, there are two groups of women, one is young women with high activity and the other is middle-aged women with low activity. They have common preferences such as clothes and shoes but also some different preferences. Due to the long-tail effect, the preferences of low-activity women are difficult to capture, and the model will recommend some popular products for them. To address this problem, we extract generalization information via the gradient aggregation module so that the model can recommend common preferences such as clothes and shoes to low-activity women, although sometimes not her favorite style. The model’s recommendation for the high-activity women and the low-activity women are more similar, so the performance of the model on high-activity women has decreased. Next, we train a plugin network for each group of women. The plugin network captures the group-specific personalization information such as different styles of clothes and shoes and other unpopular preference.

Fig. 5.
figure 5

A toy example of the long-tail effect and the improvement brought by our method.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lv, Z., Wang, F., Zhang, S., Zhang, W., Kuang, K., Wu, F. (2024). Parameters Efficient Fine-Tuning for Long-Tailed Sequential Recommendation. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_36

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8850-1_36

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8849-5

  • Online ISBN: 978-981-99-8850-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics