Skip to main content

Learning state-action correspondence across reinforcement learning control tasks via partially paired trajectories

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In many reinforcement learning (RL) tasks, the state-action space may be subject to changes over time (e.g., increased number of observable features, changes of representation of actions). Given these changes, the previously learnt policy will likely fail due to the mismatch of input and output features, and another policy must be trained from scratch, which is inefficient in terms of sample complexity. Recent works in transfer learning have succeeded in making RL algorithms more efficient by incorporating knowledge from previous tasks, thus partially alleviating this problem. However, such methods typically must provide an explicit state-action correspondence of one task into the other. An autonomous agent may not have access to such high-level information, but should be able to analyze its experience to identify similarities between tasks. In this paper, we propose a novel method for automatically learning a correspondence of states and actions from one task to another through an agent’s experience. In contrast to previous approaches, our method is based on two key insights: i) only the first state of the trajectories of the two tasks is paired, while the rest are unpaired and randomly collected, and ii) the transition model of the source task is used to predict the dynamics of the target task, thus aligning the unpaired states and actions. Additionally, this paper intentionally decouples the learning of the state-action corresponce from the transfer technique used, making it easy to combine with any transfer method. Our experiments demonstrate that our approach significantly accelerates transfer learning across a diverse set of problems, varying in state/action representation, physics parameters, and morphology, when compared to state-of-the-art algorithms that rely on cycle-consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated during and/or analyzed during the current study are available in the GIT repository, https://github.com/fjaviergp/learning_correspondence_paper

Notes

  1. In RL, a trajectory refers to a sequence of states, actions, and rewards that an agent experiences while interacting with an environment over time.

References

  1. Sutton RS, Barto AG (2011) Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA

    Google Scholar 

  2. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2023) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

  3. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nat 529(7587):484

  4. Sinha S, Mandlekar A, Garg A (2022) S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, PMLR. pp 907–917

  5. Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: A survey. J Mach Learn Res 10(7)

  6. Lazaric A (2012) Transfer in reinforcement learning: a framework and a survey. Reinforcement Learning: State of the Art, 143–173

  7. Fernández F, Veloso M (2006) Probabilistic policy reuse in a reinforcement learning agent. In: Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’06)

  8. Zhang Q, Xiao T, Efros AA, Pinto L, Wang X (2020) Learning cross-domain correspondence for control with dynamics cycle-consistency. arXiv preprint arXiv:2012.09811

  9. You H, Yang T, Zheng Y, Hao J, E Taylor M (2022) Cross-domain adaptive transfer reinforcement-learning based on state-action correspondence. In: Uncertainty in Artificial Intelligence, PMLR, pp 2299–2309

  10. Gupta A, Devin C, Liu Y, Abbeel P, Levine S (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949

  11. Taylor ME, Kuhlmann G, Stone P (2008) Autonomous transfer for reinforcement learning. In: Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, ACM, pp 283–290

  12. García J, Visús Á, Fernández F (2022) A taxonomy for similarity metrics between markov decision processes. Mach Learn 111(11):4217–4247

    Article  MathSciNet  Google Scholar 

  13. Wan M, Gangwani T, Peng J (2020) Mutual information based knowledge transfer under state-action dimension mismatch. arXiv preprint arXiv:2006.07041

  14. Fernández F, Veloso M (2013) Learning domain structure through probabilistic policy reuse in reinforcement learning. Prog Artif Intell 2(1):13–27

    Article  Google Scholar 

  15. Gamrian S, Goldberg Y (2019) Transfer learning for related reinforcement learning tasks via image-to-image translation. In: International Conference on Machine Learning, PMLR, pp 2063–2072

  16. Watkins C (1989) Learning from delayed rewards. PhD thesis, King’s College, Cambridge, UK

  17. Sinclair SR, Banerjee S, Yu CL (2023) Adaptive discretization in online reinforcement learning. Oper Res 71(5):1636–1652

    Article  MathSciNet  Google Scholar 

  18. Reinforcement Learning (2014) State-of-the-Art. In: Wiering M, Van Otterlo M (eds) Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Germany

  19. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2021) A comprehensive survey on transfer learning. IEEE Trans Neural Netw Learn Syst 32(10):4100–4122

    Google Scholar 

  20. Fernández D, Fernández F, García J (2021) Probabilistic multi-knowledge transfer in reinforcement learning. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, pp 471–476

  21. Torrey L, Walker T, Shavlik J, Maclin R (2005) Using advice to transfer knowledge acquired in one reinforcement learning task to another. In: Machine Learning: ECML 2005: 16th European Conference on Machine Learning. Proceedings 16, Springer, Porto, Portugal, 3-7 Oct 2005. pp 412–424

  22. Taylor ME, Stone P, Liu Y (2007) Transfer learning via inter-task mappings for temporal difference learning. J Mach Learn Res 8(1):2125–2167

    MathSciNet  Google Scholar 

  23. Fernández F, García J, Veloso M (2010) Probabilistic policy reuse for inter-task transfer learning. Robot Auton Syst 58(7):866–871

    Article  Google Scholar 

  24. Ammar HB, Taylor ME (2012) Reinforcement learning transfer via common subspaces. In: Adaptive and Learning Agents: International Workshop, ALA 2011, Held at AAMAS 2011, Taipei, Taiwan, May 2, 2011, Revised Selected Papers, Springer, pp 21–36

  25. Sun, Y., Yin, X., Huang, F.: Temple: Learning template of transitions for sample efficient multi-task rl. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9765–9773 (2021)

  26. Sun Y, Zheng R, Wang X, Cohen A, Huang F (2022) Transfer rl across observation feature spaces via model-based regularization. arXiv preprint arXiv:2201.00248

  27. Chen Y, Chen Y, Hu Z, Yang T, Fan C, Yu Y, Hao J (2019) Learning action-transferable policy with action embedding. arXiv preprint arXiv:1909.02291

  28. Raiman J, Zhang S, Dennison C (2019) Neural network surgery with sets. arXiv preprint arXiv:1912.06719

  29. Buljan M, Canal O, Taschin F (2021) Neural Network Surgery in Deep Reinforcement Learning. Accessed 10 Dec 2024. https://campusai.github.io/pdf/nn-surgery-report.pdf

  30. Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp 1134–1141

  31. Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bull 2(4):160–163

    Article  Google Scholar 

  32. Wu G, Fang W, Wang J, Ge P, Cao J, Ping Y, Gou P (2022) Dyna-ppo reinforcement learning with gaussian process for the continuous action decision-making in autonomous driving. Appl Intell 1–15

  33. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym

  34. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nat 518(7540):529

    Article  Google Scholar 

  35. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

  36. Barnett SA (2018) Convergence problems with generative adversarial networks (gans). arXiv preprint arXiv:1806.11382

  37. Da Silva FL, Costa AHR (2019) A survey on transfer learning for multiagent reinforcement learning systems. J Artif Intell Res 64:645–703

Download references

Funding

This research was supported in part by AEI Grant PID2020-119367RB-I00 and PID2023-153341OB-I00.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier García.

Ethics declarations

Conflicts of Interest

On behalf of all authors, the corresponding author states there is no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A    Experiment setting

Appendix A    Experiment setting

Table 3 displays the parameter settings used in the training of the source policy for each domain, spanning across seven dimensions: the state space, \(\mathcal S\), the action space, \(\mathcal A\), the learning algorithm, the number of episodes, H, and the maximum number of steps per episode, K. The learning rate, \(\alpha \), and the discount factor, \(\gamma \), are set to \(\alpha =10^{-3}\) and \(\gamma =0.99\), respectively.

Table 5 Parameter setting for the learning of the source policy

Table 4 presents the parameter settings employed for each domain in the forward dynamics model. This model takes the current state and action as input and predicts the next state. The number of samples utilized for learning the forward dynamics models is the same as that employed in the learning of the source policies.

Finally, Table 5 shows the network architectures for the functions \(\phi _{s}\), \(\phi _{a}\), and \(\phi _{a}^{-}\). Table 5 presents supplementary rows for each of the target tasks designed within the Swimmer domain. These rows are denoted as Swimmer-X, with X representing the specific number of limbs under consideration.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

García, J., Rañó, I., Burés, J.M. et al. Learning state-action correspondence across reinforcement learning control tasks via partially paired trajectories. Appl Intell 55, 219 (2025). https://doi.org/10.1007/s10489-024-06190-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06190-7

Keywords