An Approach of Transforming Non-Markovian Reward to Markovian Reward

Miao, Ruixuan; Lu, Xu; Cui, Jin

doi:10.1007/978-3-031-29476-1_2

Ruixuan Miao¹⁰,
Xu Lu¹⁰ &
Jin Cui¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13854))

Included in the following conference series:

International Workshop on Structured Object-Oriented Formal Language and Method

Abstract

In many decision-making problems, a rational reward function is required, which can correctly guide agents to make ideal operations. For example, an intelligent robot needs to check its power before sweeping. This kind of reward functions involves historical states, rather than a single current state. It is referred to as non-Markovian reward. However, state-of-the-art MDP (Markov Decision Process) planners only support Markovian reward. In this paper, we present an approach to transform non-Markovian reward expressed in ${LTL}_{f}$ (Linear Temporal Logic over Finite Traces) into Markovian reward. ${LTL}_{f}$ is converted into an automaton which is compiled to standard MDP model. Then the reward function of the model is further optimized through reward shaping in order to speed up planning. The reshaped reward function can be exploited by MDP planners to guide search and produce good training results. Finally, experiments with augmented International Probabilistic Planning Competition (IPPC) domain demonstrates the effectiveness and feasibility of our approach, especially the reshaped reward function can significantly improve the performance of planners.

This research is supported by National Natural Science Foundation of China (61806158); China Postdoctoral Science Foundation (2019T120881, 2018M643585); Fundamental Research Funds for the Central Universities (XJS220304); Special scientific Research Project of Education Department of Shaanxi Province (21JK0844)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Dynamic and Task-Independent Reward Shaping Approach for Discrete Partially Observable Markov Decision Processes

Reinforcement Learning with Temporal-Logic-Based Causal Diagrams

POMDP Controllers with Optimal Budget

References

Li, J., Pu, G., Zhang, Y., et al.: Sat-based explicit ltlf satisfiability checking. Artif. Intell. 289, 103369 (2020)
Article MATH Google Scholar
Sanner, S.: Relational dynamic influence diagram language (RDDL): Language description. Unpublished ms. Australian National University 32, 27 (2010)
Google Scholar
Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE (2017)
Google Scholar
Icarte, R.T., Klassen, T., Valenzano, R., et al.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp. 2107–2116 (2018)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley 2014)
Google Scholar
Thiébaux, S., Kabanza, F., Slanley, J.: Anytime state-based solution methods for decision processes with non-Markovian rewards. arXiv preprint arXiv:1301.0606, 2012
Bonet, B., Geffner, H.: mGPT: a probabilistic planner based on heuristic search. J. Artif. Intell. Res. 24, 933–944 (2005)
Article MATH Google Scholar
Pulver, H., Eiras, F., Carozza, L., et al.: PILOT: Efficient planning by imitation learning and optimisation for safe autonomous driving. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1442–1449. IEEE (2021)
Google Scholar
Keller, T., Eyerich, P.: PROST: probabilistic planning based on UCT. In: Twenty-Second International Conference on Automated Planning and Scheduling (2012)
Google Scholar
Geißer, F., Speck, D., Keller, T.: An analysis of the probabilistic track of the IPC 2018. In: ICAPS 2019 Workshop on the International Planning Competition (WIPC), pp. 27–35 (2019)
Google Scholar
Camacho, A., Chen, O., Sanner, S., et al.: Non-Markovian rewards expressed in LTL: Guiding search via reward shaping (extended version). In: GoalsRL, a Workshop Collocated with ICML/IJCAI/AAMAS (2018)
Google Scholar
Lucas, S.M., Reynolds, T.J.: Learning deterministic finite automata with a smart state labeling evolutionary algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 27(7), 1063–1074 (2005)
Article Google Scholar
Brafman, R., De Giacomo, G., Patrizi, F.: LTLf/LDLf non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1 (2018)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. Icml 99, 278–287 (1999)
Google Scholar
Sohrabi, S., Baier, J.A., Mcilraith, S.A.: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Preferred Explanations: Theory and Generation via Planning (2014)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)
Google Scholar
Little, I., Thiebaux, S.: Probabilistic planning vs. replanning. In: ICAPS Workshop on IPC: Past, Present and Future (2007)
Google Scholar
Bacchus, F., Boutilier, C., Grove, A.: Rewarding behaviors. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1160–1167 (1996)
Google Scholar
Levesque, H.J., Reiter, R., Lespérance, Y., et al.: GOLOG: a logic programming language for dynamic domains. J. Logic Program. 31(1–3), 59–83 (1997)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Theory and Technology and State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, People’s Republic of China
Ruixuan Miao & Xu Lu
School of Computer Science, Xi’an Shiyou University, Xi’an, People’s Republic of China
Jin Cui

Authors

Ruixuan Miao
View author publications
You can also search for this author in PubMed Google Scholar
Xu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jin Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Lu .

Editor information

Editors and Affiliations

Hiroshima University, Hiroshima, Japan
Shaoying Liu
Xidian University, Xi’an, China
Zhenhua Duan
Hiroshima University, Hiroshima, Japan
Ai Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miao, R., Lu, X., Cui, J. (2023). An Approach of Transforming Non-Markovian Reward to Markovian Reward. In: Liu, S., Duan, Z., Liu, A. (eds) Structured Object-Oriented Formal Language and Method. SOFL+MSVL 2022. Lecture Notes in Computer Science, vol 13854. Springer, Cham. https://doi.org/10.1007/978-3-031-29476-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-29476-1_2
Published: 25 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29475-4
Online ISBN: 978-3-031-29476-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics