Skip to main content
Log in

Efficient policy evaluation by matrix sketching

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In the reinforcement learning, policy evaluation aims to predict long-term values of a state under a certain policy. Since high-dimensional representations become more and more common in the reinforcement learning, how to reduce the computational cost becomes a significant problem to the policy evaluation. Many recent works focus on adopting matrix sketching methods to accelerate least-square temporal difference (TD) algorithms and quasi-Newton temporal difference algorithms. Among these sketching methods, the truncated incremental SVD shows better performance because it is stable and efficient. However, the convergence properties of the incremental SVD is still open. In this paper, we first show that the conventional incremental SVD algorithms could have enormous approximation errors in the worst case. Then we propose a variant of incremental SVD with better theoretical guarantees by shrinking the singular values periodically. Moreover, we employ our improved incremental SVD to accelerate least-square TD and quasi-Newton TD algorithms. The experimental results verify the correctness and effectiveness of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sutton R S, Barto A G. Reinforcement Learning: an Introduction. 2nd ed. London: MIT Press, 2018

    MATH  Google Scholar 

  2. Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control. 1995, 560–564

  3. Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107–1149

    MathSciNet  MATH  Google Scholar 

  4. Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809–883

    MathSciNet  MATH  Google Scholar 

  5. Geist M, Scherrer B. Off-policy learning with eligibility traces: a survey. The Journal of Machine Learning Research, 2014, 15(1): 289–333

    MathSciNet  MATH  Google Scholar 

  6. Liang Y T, Machado M C, Talvitie E, Bowling M. State of the art control of Atari games using shallow reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 485–493

  7. Tagorti M, Scherrer B. On the rate of convergence and error bounds for LSTD (λ). In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1521–1529

  8. Sutton R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9–44

    Article  Google Scholar 

  9. Sutton R S, Szepesvári C, Maei H R. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1609–1616

  10. Boyan J A. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2–3): 233–246

    Article  MATH  Google Scholar 

  11. Geramifard A, Bowling M H, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 2006, 356–361

  12. Geramifard A, Bowling M H, Zinkevich M, Sutton R S. iLSTD: eligibility traces and convergence analysis. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2006, 441–448

  13. Pan Y C, White A M, White M. Accelerated gradient temporal difference learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2464–2470

  14. Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems: 24th Annual Conference on Neural Information Processing Systems 2010. 2010, 721–729

  15. Pan Y C, Azer E S, White M. Effective sketching methods for value function approximation. In: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. 2017

  16. Gehring C, Pan Y C, White M. Incremental truncated LSTD. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1505–1511

  17. Li H F, Xia Y C, Zhang W S. Finite sample analysis of LSTD with random projections and eligibility traces. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2390–2396

  18. Woodruff D P. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 2014, 10(1–2): 1–157

    Article  MathSciNet  MATH  Google Scholar 

  19. Bertsekas D P. Dynamic Programming and Optimal Control. Volume 1. Belmont: Athena Scientific, 1995

    MATH  Google Scholar 

  20. Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528

  21. Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57

    Article  MATH  Google Scholar 

  22. Liberty E. Simple and deterministic matrix sketching. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 581–588

  23. Ghashami M, Liberty E, Phillips J M, Woodruff D P. Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 2016, 45(5): 1762–1792

    Article  MathSciNet  MATH  Google Scholar 

  24. Kuzborskij I, Cella L, Cesa-Bianchi N. Efficient linear bandits through matrix sketching. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 177–185

  25. Luo H P, Agarwal A, Cesa-Bianchi N, Langford J. Efficient second order online learning by sketching. In: Proceedings of Advances in Neural Information Processing Systems. 2016, 902–910

  26. Luo L, Chen C, Zhang Z H, Li W J, Zhang T. Robust frequent directions with application in online learning. Journal of Machine Learning Research, 2019, 20(45): 1–41

    MathSciNet  MATH  Google Scholar 

  27. Mroueh Y, Marcheret E, Goel V. Co-occurring directions sketching for approximate matrix multiply. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 567–575

  28. Brand M. Fast online SVD revisions for lightweight recommender systems. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 37–46

  29. Sarwar B, Karypis G, Konstan J, Riedl J. Incremental singular value decomposition algorithms for highly scalable recommender systems. In: Proceedings of the 5th International Conference on Computer and Information Science. 2002, 27–28

  30. Ross D A, Lim J, Lin R S, Yang M H. Incremental learning for robust visual tracking. International Journal of Computer Vision, 2008, 77(1–3): 125–141

    Article  Google Scholar 

  31. Hall P M, Marshall D, Martin R R. Incremental eigenanalysis for classification. In: Proceedings of the British Machine Vision Conference. 1998, 1–10

  32. Brand M. Incremental singular value decomposition of uncertain data with missing values. In: Proceedings of the 7th European Conference on Computer Vision. 2002, 707–720

  33. Brand M. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 2006, 415(1): 20–30

    Article  MathSciNet  MATH  Google Scholar 

  34. Salas D F, Powell W B. Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of grid-level energy storage. INFORMS Journal on Computing, 2018, 30(1): 106–123

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The corresponding author Weinan Zhang was supported by the “New Generation of AI 2030” Major Project (2018AAA0100900) and the National Natural Science Foundation of China (Grant Nos. 62076161, 61772333, 61632017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weinan Zhang.

Additional information

Cheng Chen is currently a PhD candidate in APEX Lab at Shanghai Jiao Tong University, China. He received his bachelor’s degree at the Department of Computer Science in Shanghai Jiao Tong University, China in 2013. His research interest lies in matrix approximation, online learning and optimization.

Weinan Zhang received his PhD degree from University College London in 2016 and his BS degree from the ACM Class of Shanghai Jiao Tong University, China in 2011. He is currently an assistant professor with the Department of Computer Science, Shanghai Jiao Tong University. He has published over 50 research papers on conferences and journals, including KDD, SIGIR, AAAI, WWW, WSDM, ICDM, JMLR, IPM, and so on. His research interests include machine learning and big data mining, particularly, deep learning and reinforcement learning techniques for real-world data mining scenarios, such as computational advertising, recommendation systems, text mining, Web search, and knowledge graphs.

Yong Yu received his MS degree from the CS Department, East China Normal University, China. He is currently a professor with the Department of Computer Science, Shanghai Jiao Tong University, China and the Director of the Apex Data & Knowledge Management Lab. As the principal investigator, he took charge of several National Natural Science Foundation of China and China National High Tech (863) Program projects. His research interests include Web search, semantic search, data mining, and machine learning. He has published over 200 papers and served as a PC Member of several conferences, including WWW, RecSys, and a dozen of other related conferences, such as NIPS, ICML, SIGIR, ISWC, and so on.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Zhang, W. & Yu, Y. Efficient policy evaluation by matrix sketching. Front. Comput. Sci. 16, 165330 (2022). https://doi.org/10.1007/s11704-021-0354-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-0354-4

Keywords

Navigation