Abstract
In many decision-making problems, a rational reward function is required, which can correctly guide agents to make ideal operations. For example, an intelligent robot needs to check its power before sweeping. This kind of reward functions involves historical states, rather than a single current state. It is referred to as non-Markovian reward. However, state-of-the-art MDP (Markov Decision Process) planners only support Markovian reward. In this paper, we present an approach to transform non-Markovian reward expressed in \({LTL}_{f}\) (Linear Temporal Logic over Finite Traces) into Markovian reward. \({LTL}_{f}\) is converted into an automaton which is compiled to standard MDP model. Then the reward function of the model is further optimized through reward shaping in order to speed up planning. The reshaped reward function can be exploited by MDP planners to guide search and produce good training results. Finally, experiments with augmented International Probabilistic Planning Competition (IPPC) domain demonstrates the effectiveness and feasibility of our approach, especially the reshaped reward function can significantly improve the performance of planners.
This research is supported by National Natural Science Foundation of China (61806158); China Postdoctoral Science Foundation (2019T120881, 2018M643585); Fundamental Research Funds for the Central Universities (XJS220304); Special scientific Research Project of Education Department of Shaanxi Province (21JK0844)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Li, J., Pu, G., Zhang, Y., et al.: Sat-based explicit ltlf satisfiability checking. Artif. Intell. 289, 103369 (2020)
Sanner, S.: Relational dynamic influence diagram language (RDDL): Language description. Unpublished ms. Australian National University 32, 27 (2010)
Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE (2017)
Icarte, R.T., Klassen, T., Valenzano, R., et al.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp. 2107–2116 (2018)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley 2014)
Thiébaux, S., Kabanza, F., Slanley, J.: Anytime state-based solution methods for decision processes with non-Markovian rewards. arXiv preprint arXiv:1301.0606, 2012
Bonet, B., Geffner, H.: mGPT: a probabilistic planner based on heuristic search. J. Artif. Intell. Res. 24, 933–944 (2005)
Pulver, H., Eiras, F., Carozza, L., et al.: PILOT: Efficient planning by imitation learning and optimisation for safe autonomous driving. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1442–1449. IEEE (2021)
Keller, T., Eyerich, P.: PROST: probabilistic planning based on UCT. In: Twenty-Second International Conference on Automated Planning and Scheduling (2012)
Geißer, F., Speck, D., Keller, T.: An analysis of the probabilistic track of the IPC 2018. In: ICAPS 2019 Workshop on the International Planning Competition (WIPC), pp. 27–35 (2019)
Camacho, A., Chen, O., Sanner, S., et al.: Non-Markovian rewards expressed in LTL: Guiding search via reward shaping (extended version). In: GoalsRL, a Workshop Collocated with ICML/IJCAI/AAMAS (2018)
Lucas, S.M., Reynolds, T.J.: Learning deterministic finite automata with a smart state labeling evolutionary algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 27(7), 1063–1074 (2005)
Brafman, R., De Giacomo, G., Patrizi, F.: LTLf/LDLf non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1 (2018)
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. Icml 99, 278–287 (1999)
Sohrabi, S., Baier, J.A., Mcilraith, S.A.: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Preferred Explanations: Theory and Generation via Planning (2014)
Ng, A.Y., Harada, D., Russell, S.: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)
Little, I., Thiebaux, S.: Probabilistic planning vs. replanning. In: ICAPS Workshop on IPC: Past, Present and Future (2007)
Bacchus, F., Boutilier, C., Grove, A.: Rewarding behaviors. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1160–1167 (1996)
Levesque, H.J., Reiter, R., Lespérance, Y., et al.: GOLOG: a logic programming language for dynamic domains. J. Logic Program. 31(1–3), 59–83 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Miao, R., Lu, X., Cui, J. (2023). An Approach of Transforming Non-Markovian Reward to Markovian Reward. In: Liu, S., Duan, Z., Liu, A. (eds) Structured Object-Oriented Formal Language and Method. SOFL+MSVL 2022. Lecture Notes in Computer Science, vol 13854. Springer, Cham. https://doi.org/10.1007/978-3-031-29476-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-29476-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29475-4
Online ISBN: 978-3-031-29476-1
eBook Packages: Computer ScienceComputer Science (R0)