Off-Policy Differentiable Logic Reinforcement Learning

Zhang, Li; Li, Xin; Wang, Mingzhong; Tian, Andong

doi:10.1007/978-3-030-86520-7_38

Li Zhang¹³,
Xin Li¹³,
Mingzhong Wang¹⁴ &
…
Andong Tian¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1950 Accesses
3 Citations

Abstract

In this paper, we proposed an Off-Policy Differentiable Logic Reinforcement Learning (OPDLRL) framework to inherit the benefits of interpretability and generalization ability in Differentiable Inductive Logic Programming (DILP) and also resolves its weakness of execution efficiency, stability, and scalability. The key contributions include the use of approximate inference to significantly reduce the number of logic rules in the deduction process, an off-policy training method to enable approximate inference, and a distributed and hierarchical training framework. Extensive experiments, specifically playing real-time video games in Rabbids against human players, show that OPDLRL has better or similar performance as other DILP-based methods but far more practical in terms of sample efficiency and execution efficiency, making it applicable to complex and (near) real-time domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/Raving_Rabbids.
2.
The comparison of different parameterization methods can be found in the Supplementary Material.
3.
We used the same setting of hyper-parameters for all tasks.
4.
Drop rate represents the percentage of rules ignored in the approximate inference, see Eq. (5).

References

Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete Event Dyn. Syst. 13(1–2), 41–77 (2003)
Article MathSciNet Google Scholar
Boutilier, C., Reiter, R., Price, B.: Symbolic dynamic programming for first-order MDPs. IJCAI 1, 690–700 (2001)
Google Scholar
Christodoulou, P.: Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019)
Colledanchise, M., Ögren, P.: Behavior trees in robotics and AI: an introduction. CoRR abs/1709.00084 (2017)
Google Scholar
Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 13, 227–303 (2000)
Article MathSciNet Google Scholar
Dong, H., Mao, J., Lin, T., Wang, C., Li, L., Zhou, D.: Neural logic machines. arXiv preprint arXiv:1904.11694 (2019)
Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61, 1–64 (2018)
Article MathSciNet Google Scholar
Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data (extended abstract). In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 5598–5602. International Joint Conferences on Artificial Intelligence Organization (2018). https://doi.org/10.24963/ijcai.2018/792
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019)
Google Scholar
Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361 (2017). JMLR.org
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Haarnoja, T., et al.: Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018)
Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 2015, pp. 805–813 (2015). JMLR.org
Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games. CoRR abs/1603.01121 (2016)
Google Scholar
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Horgan, D., et al.: Distributed prioritized experience replay. In: International Conference on Learning Representations (2018)
Google Scholar
Jiang, Z., Luo, S.: Neural logic reinforcement learning. In: International Conference on Machine Learning. pp. 3110–3119 (2019)
Google Scholar
Kersting, K., Otterlo, M.V., De Raedt, L.: Bellman goes relational. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 59 (2004)
Google Scholar
Lanctot, M., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4190–4203. Curran Associates, Inc. (2017)
Google Scholar
Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. In: Advances in Neural Information Processing Systems, pp. 2181–2191 (2017)
Google Scholar
Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018)
Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: DeepProbLog: neural probabilistic logic programming. In: Advances in Neural Information Processing Systems, pp. 3749–3759 (2018)
Google Scholar
Mei, J., Xiao, C., Huang, R., Schuurmans, D., Müller, M.: On principled entropy exploration in policy optimization. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 3130–3136. International Joint Conferences on Artificial Intelligence Organization (2019). https://doi.org/10.24963/ijcai.2019/434
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Muggleton, S.: Inductive logic programming. New Generation Comput. 8(4), 295–318 (1991)
Article Google Scholar
Muggleton, S., et al.: Stochastic logic programs. Adv. Inductive Logic Programm. 32, 254–264 (1996)
MathSciNet Google Scholar
Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 3303–3313 (2018)
Google Scholar
Payani, A., Fekri, F.: Inductive logic programming via differentiable deep neural logic networks. arXiv preprint arXiv:1906.03523 (2019)
Sanner, S., Boutilier, C.: Practical solution techniques for first-order MDPs. Artif. Intell. 173(5–6), 748–788 (2009)
Article MathSciNet Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Google Scholar
Verma, A., Le, H., Yue, Y., Chaudhuri, S.: Imitation-projected programmatic reinforcement learning. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 15752–15763. Curran Associates, Inc. (2019)
Google Scholar
Vezhnevets, A.S., et al.: Feudal networks for hierarchical reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3540–3549 (2017). JMLR.org
Wang, Q., Li, Y., Xiong, J., Zhang, T.: Divergence-augmented policy optimization. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 6099–6110. Curran Associates, Inc. (2019)
Google Scholar
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
MATH Google Scholar
Wulfmeier, M., Ondruska, P., Posner, I.: Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888 (2015)
Zhou, H., Lan, J., Liu, R., Yosinski, J.: Deconstructing lottery tickets: zeros, signs, and the supermask. In: Advances in Neural Information Processing Systems, pp. 3592–3602 (2019)
Google Scholar
Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maximum causal entropy (2018). https://doi.org/10.1184/R1/6720692.v1
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI, vol. 8, pp. 1433–1438. Chicago (2008)
Google Scholar

Download references

Acknowledgement

Their work is partially supported by NSFC under Grant (U19B2020 and 61772074) and National Key R&D Program of China under Grant (2017YFB0803300).

Author information

Authors and Affiliations

School of Computer Science, Beijing Institute of Technology, Beijing, China
Li Zhang & Xin Li
USC Business School, University of the Sunshine Coast, Sippy Downs, Australia
Mingzhong Wang
Ubisoft China AI & Data Lab, Chengdu, China
Andong Tian

Authors

Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Li
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Andong Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Li .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Li, X., Wang, M., Tian, A. (2021). Off-Policy Differentiable Logic Reinforcement Learning. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-86520-7_38
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)