Skip to main content

Advertisement

Log in

Mastering table tennis with hierarchy: a reinforcement learning approach with progressive self-play training

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Hierarchical Reinforcement Learning (HRL) is widely applied in various complex task scenarios. In complex tasks where simple model-free reinforcement learning struggles, hierarchical design allows for more efficient utilization of interactive data, significantly reducing training costs and improving training success rates. This study delves into the use of HRL based on the model-free policy layer to learn complex strategies for a robotic arm playing table tennis. Through processes such as pre-training, self-play training, and self-play training with top-level winning strategies, the robustness of the lower-level hitting strategies has been enhanced. Furthermore, a novel decay reward mechanism has been employed in the training of the higher-level agent to improve the win rate in adversarial matches against other methods. After pre-training and adversarial training, we achieved an average of 52 rally cycles for the forehand strategy and 48 rally cycles for the backhand strategy in testing. The high-level strategy training based on the decay reward mechanism resulted in an advantageous score when competing against other strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data Availability

Data will be made available on request. If someone wants to request the data from this study, they can contact corresponding author.

References

  1. Li Y (2017) Deep reinforcement learning: an overview. arXiv:1701.07274

  2. Ibarz J, Tan J, Finn C, Kalakrishnan M, Pastor P, Levine S (2021) How to train your robot with deep reinforcement learning: lessons we have learned. Int J Robot Res 40(4–5):698–721

    Article  Google Scholar 

  3. Lample G, Chaplot DS (2017) Playing fps games with deep reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31

  4. Yang Y, Juntao L, Lingling P (2020) Multi-robot path planning based on a deep reinforcement learning dqn algorithm. CAAI Trans Intell Technol 5(3):177–183

    Article  Google Scholar 

  5. Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38

    Article  Google Scholar 

  6. François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J et al (2018) An introduction to deep reinforcement learning. Found Trends Mach Learn 11(3–4):219–354

    Article  Google Scholar 

  7. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971

  8. Atkeson CG, Santamaria JC (1997) A comparison of direct and model-based reinforcement learning. In: Proceedings of international conference on robotics and automation, vol. 4. IEEE, pp 3557–3564

  9. Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dyn Syst 13(1–2):41–77

    Article  MathSciNet  Google Scholar 

  10. Mahjourian R, Miikkulainen R, Lazic N, Levine S, Jaitly N (2018) Hierarchical policy design for sample-efficient learning of robot table tennis through self-play. arXiv:1811.12927

  11. Tebbe J, Krauch L, Gao Y, Zell A (2021) Sample-efficient reinforcement learning in robotic table tennis. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 4171–4178

  12. Gao W, Graesser L, Choromanski K, Song X, Lazic N, Sanketi P, Sindhwani V, Jaitly N (2020) Robotic table tennis with model-free reinforcement learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp 5556–5563

  13. Wang Y, Sun Z, Luo Y, Zhang H, Zhang W, Dong K, He Q, Zhang Q, Cheng E, Song B (2023) A novel trajectory-based ball spin estimation method for table tennis robot. IEEE Trans Ind Electron 1–11. https://doi.org/10.1109/TIE.2023.3319743

  14. Wang Y, Luo Y, Zhang H, Zhang W, Dong K, He Q, Zhang Q, Cheng E, Sun Z, Song B (2023) A table-tennis robot control strategy for returning high-speed spinning ball. IEEE/ASME Trans Mechatron 1–10. https://doi.org/10.1109/TMECH.2023.3316165

  15. Ding T, Graesser L, Abeyruwan S, D’Ambrosio DB, Shankar A, Sermanet P, Sanketi PR, Lynch C (2022) Learning high speed precision table tennis on a physical robot. In: 2022 IEEE/RSJ international conference on Intelligent Robots and Systems (IROS). pp 10780–10787. https://doi.org/10.1109/IROS47612.2022.9982205

  16. Ma H, Büchler D, Schölkopf B, Muehlebach M (2023) Reinforcement learning with model-based feedforward inputs for robotic table tennis. Auton Robots 1–17

  17. Ma H, Fan J, Wang Q (2022) A novel ping-pong task strategy based on model-free multi-dimensional q-function deep reinforcement learning. In: 2022 8th International Conference on Systems and Informatics (ICSAI). IEEE, pp 1–6

  18. Al-Emran M (2015) Hierarchical reinforcement learning: a survey. Int J Comput Digital Syst 4(02)

  19. Yuan J, Zhang J, Yan J (2022) Towards solving industrial sequential decision-making tasks under near-predictable dynamics via reinforcement learning: an implicit corrective value estimation approach

  20. Cuayáhuitl H, Dethlefs N, Frommberger L, Richter K-F, Bateman JA (2010) Generating adaptive route instructions using hierarchical reinforcement learning. In: Spatial cognition, vol. 7. Springer, pp 319–334

  21. Xu X, Huang T, Wei P, Narayan A, Leong T-Y (2020) Hierarchical reinforcement learning in starcraft ii with human expertise in subgoals selection. arXiv:2008.03444

  22. Dethlefs N, Cuayáhuitl H (2015) Hierarchical reinforcement learning for situated natural language generation. Nat Lang Eng 21(3):391–435

    Article  Google Scholar 

  23. Araki B, Li X, Vodrahalli K, DeCastro J, Fry M, Rus D (2021) The logical options framework. In: International conference on machine learning. PMLR, pp 307–317

  24. Nachum O, Gu SS, Lee H, Levine S (2018) Data-efficient hierarchical reinforcement learning. Adv Neural Inf Process Syst 31

  25. Bacon P-L, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31

  26. Kulkarni TD, Narasimhan K, Saeedi A, Tenenbaum J (2016) Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Adv Neural Inf Process Syst 29

  27. Ji Y, Li Z, Sun Y, Peng XB, Levine S, Berseth G, Sreenath K (2022) Hierarchical reinforcement learning for precise soccer shooting skills using a quadrupedal robot. In: 2022 IEEE/RSJ international conference on Intelligent Robots and Systems (IROS). pp 1479–1486. https://doi.org/10.1109/IROS47612.2022.9981984

  28. Huang X, Li Z, Xiang Y, Ni Y, Chi Y, Li Y, Yang L, Peng XB, Sreenath K (2023) Creating a dynamic quadrupedal robotic goalkeeper with reinforcement learning. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp 2715–2722 . https://doi.org/10.1109/IROS55552.2023.10341936

  29. Hu R, Zhang Y (2022) Fast path planning for long-range planetary roving based on a hierarchical framework and deep reinforcement learning. Aerospace 9(2):101

    Article  Google Scholar 

  30. Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning. PMLR, pp 551–560

  31. Hernandez D, Denamganaï K, Gao Y, York P, Devlin S, Samothrakis S, Walker JA (2019) A generalized framework for self-play training. In: 2019 IEEE Conference on Games (CoG). IEEE, pp 1–8

  32. Zhang H, Yu T (2020) Alphazero. Deep reinforcement learning: fundamentals, research and applications. pp 391–415

  33. Brandão B, De Lima TW, Soares A, Melo L, Maximo MROA (2022) Multiagent reinforcement learning for strategic decision making and control in robotic soccer through self-play. IEEE Access 10:72628–72642. https://doi.org/10.1109/ACCESS.2022.3189021

    Article  Google Scholar 

  34. Lin F, Huang S, Pearce T, Chen W, Tu W-W (2023) Tizero: Mastering multi-agent football with curriculum learning and self-play. arXiv:2302.07515

  35. Wang X, Thomas JD, Piechocki RJ, Kapoor S, Santos-Rodríguez R, Parekh A (2022) Self-play learning strategies for resource assignment in open-ran networks. Comput Netw 206:108682

    Article  Google Scholar 

  36. Andersson RL (1989) Aggressive trajectory generator for a robot ping-pong player. IEEE Control Syst Mag 9(2):15–21

    Article  Google Scholar 

  37. Lin H-I, Yu Z, Huang Y-C (2020) Ball tracking and trajectory prediction for table-tennis robots. Sensors 20(2):333

    Article  Google Scholar 

  38. Miyazaki F, Matsushima M, Takeuchi M (2006) Learning to dynamically manipulate: a table tennis robot controls a ball and rallies with a human being. Advances in robot control: from everyday physics to human-like movements. https://doi.org/10.1007/978-3-540-37347-6_15

  39. Koç O, Maeda G, Peters J (2018) Online optimal trajectory generation for robot table tennis. Robot Auton Syst 105:121–137. https://doi.org/10.1016/j.robot.2018.03.012

    Article  Google Scholar 

  40. Mülling K, Kober J, Peters J (2010) Simulating human table tennis with a biomimetic robot setup. pp 273–282. https://doi.org/10.1007/978-3-642-15193-4_26

  41. Huang Y, Xu D, Tan M, Su H (2011) Trajectory prediction of spinning ball for ping-pong player robot. pp 3434–3439. https://doi.org/10.1109/IROS.2011.6095044

  42. Kyohei A, Masamune N, Satoshi Y (2020) The ping pong robot to return a ball precisely. Omron TECHNICS 51:1–6

    Google Scholar 

  43. Zhao Y, Xiong R, Zhang Y (2017) Model based motion state estimation and trajectory prediction of spinning ball for ping-pong robots using expectation-maximization algorithm. J Intell Robot Syst 87(3):407–423

    Article  Google Scholar 

  44. Lin H-I, Huang Y-C (2019) Ball trajectory tracking and prediction for a ping-pong robot. In: 2019 9th International Conference on Information Science and Technology (ICIST). IEEE, pp 222–227

  45. Abeyruwan SW, Graesser L, D’Ambrosio DB, Singh A, Shankar A, Bewley A, Jain D, Choromanski KM, Sanketi PR (2023) i-sim2real: reinforcement learning of robotic policies in tight human-robot interaction loops. In: Conference on robot learning. PMLR, pp 212–224

  46. Büchler D, Guist S, Calandra R, Berenz V, Schölkopf B, Peters J (2022) Learning to play table tennis from scratch using muscular robots. IEEE Trans Robotics

  47. Zhu Y, Zhao Y, Jin L, Wu J, Xiong R (2018) Towards high level skill learning: Learn to return table tennis ball using monte-carlo based policy gradient method. In: 2018 IEEE international conference on Real-time Computing and Robotics (RCAR). pp 34–41. https://doi.org/10.1109/RCAR.2018.8621776

  48. Tebbe J, Gao Y, Sastre-Rienietz M, Zell A (2019) A table tennis robot system using an industrial kuka robot arm. In: Brox T, Bruhn A, Fritz M (eds) Pattern recognition. Springer, Cham, pp 33–45

    Chapter  Google Scholar 

  49. Tebbe J (2022) Adaptive robot systems in highly dynamic environments: a table tennis robot. PhD thesis, Universität Tübingen

  50. Gao Y, Tebbe J, Zell A (2023) Optimal stroke learning with policy gradient approach for robotic table tennis. Appl Intell 53(11):13309–13322

    Article  Google Scholar 

  51. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285

    Article  Google Scholar 

  52. Puterman ML (1990) Markov decision processes. Handbooks Oper Res Management Sci 2:331–434

    MathSciNet  Google Scholar 

  53. Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8:279–292

    Article  Google Scholar 

  54. Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. In: Learning for dynamics and control. PMLR, pp 486–489

  55. Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Adv Neural Inf Process Syst 12:100

    Google Scholar 

  56. Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: Machine Learning: ECML 2005: 16th European conference on machine learning, Porto, Portugal, October 3-7, 2005. Proceedings 16. Springer, pp 280–291

  57. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning. PMLR, pp 1861–1870

  58. Senadeera M, Karimpanal TG, Gupta S, Rana S (2022) Sympathy-based reinforcement learning agents. In: Proceedings of the 21st international conference on autonomous agents and multiagent systems. pp 1164–1172

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (GA 61876054).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Wang.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 7294 KB)

Appendices

Appendix A: Convergence of pre-training rewards

 

Fig. 9
figure 9

Convergence of pre-training rewards. The convergence of the forehand strategy is achieved after approximately 6,000 episodes

Fig. 10
figure 10

Convergence of pre-training rewards. The convergence of the backhand strategy is achieved after approximately 5,500 episodes

Appendix B: Convergence failure of reward components during pre-training using the DDPG method

 

Fig. 11
figure 11

Convergence failure of reward components during pre-training using the DDPG method.. Using the DDPG method, pre-training converges to the boundaries within 200-300 iterations but fails to converge to the optimal target

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, H., Fan, J., Xu, H. et al. Mastering table tennis with hierarchy: a reinforcement learning approach with progressive self-play training. Appl Intell 55, 562 (2025). https://doi.org/10.1007/s10489-025-06450-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-025-06450-0

Keywords

Profiles

  1. Qiang Wang