In the reinforcement learning, policy evaluation aims to predict long-term values of a state under a certain policy. Since high-dimensional representations become more and more common in the reinforcement learning, how to reduce the computational cost becomes a significant problem to the policy evaluation. Many recent works focus on adopting matrix sketching methods to accelerate least-square temporal difference (TD) algorithms and quasi-Newton temporal difference algorithms. Among these sketching methods, the truncated incremental SVD shows better performance because it is stable and efficient. However, the convergence properties of the incremental SVD is still open. In this paper, we first show that the conventional incremental SVD algorithms could have enormous approximation errors in the worst case. Then we propose a variant of incremental SVD with better theoretical guarantees by shrinking the singular values periodically. Moreover, we employ our improved incremental SVD to accelerate least-square TD and quasi-Newton TD algorithms. The experimental results verify the correctness and effectiveness of our methods.
Similar content being viewed by others
Sutton R S, Barto A G. Reinforcement Learning: an Introduction. 2nd ed. London: MIT Press, 2018
Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control. 1995, 560–564
Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107–1149
Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809–883
Geist M, Scherrer B. Off-policy learning with eligibility traces: a survey. The Journal of Machine Learning Research, 2014, 15(1): 289–333
Liang Y T, Machado M C, Talvitie E, Bowling M. State of the art control of Atari games using shallow reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 485–493
Tagorti M, Scherrer B. On the rate of convergence and error bounds for LSTD (λ). In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1521–1529
Sutton R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9–44
Sutton R S, Szepesvári C, Maei H R. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1609–1616
Boyan J A. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2–3): 233–246
Geramifard A, Bowling M H, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 2006, 356–361
Geramifard A, Bowling M H, Zinkevich M, Sutton R S. iLSTD: eligibility traces and convergence analysis. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2006, 441–448
Pan Y C, White A M, White M. Accelerated gradient temporal difference learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2464–2470
Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems: 24th Annual Conference on Neural Information Processing Systems 2010. 2010, 721–729
Pan Y C, Azer E S, White M. Effective sketching methods for value function approximation. In: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. 2017
Gehring C, Pan Y C, White M. Incremental truncated LSTD. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1505–1511
Li H F, Xia Y C, Zhang W S. Finite sample analysis of LSTD with random projections and eligibility traces. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2390–2396
Woodruff D P. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 2014, 10(1–2): 1–157
Bertsekas D P. Dynamic Programming and Optimal Control. Volume 1. Belmont: Athena Scientific, 1995
Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528
Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
Liberty E. Simple and deterministic matrix sketching. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 581–588
Ghashami M, Liberty E, Phillips J M, Woodruff D P. Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 2016, 45(5): 1762–1792
Kuzborskij I, Cella L, Cesa-Bianchi N. Efficient linear bandits through matrix sketching. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 177–185
Luo H P, Agarwal A, Cesa-Bianchi N, Langford J. Efficient second order online learning by sketching. In: Proceedings of Advances in Neural Information Processing Systems. 2016, 902–910
Luo L, Chen C, Zhang Z H, Li W J, Zhang T. Robust frequent directions with application in online learning. Journal of Machine Learning Research, 2019, 20(45): 1–41
Mroueh Y, Marcheret E, Goel V. Co-occurring directions sketching for approximate matrix multiply. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 567–575
Brand M. Fast online SVD revisions for lightweight recommender systems. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 37–46
Sarwar B, Karypis G, Konstan J, Riedl J. Incremental singular value decomposition algorithms for highly scalable recommender systems. In: Proceedings of the 5th International Conference on Computer and Information Science. 2002, 27–28
Ross D A, Lim J, Lin R S, Yang M H. Incremental learning for robust visual tracking. International Journal of Computer Vision, 2008, 77(1–3): 125–141
Hall P M, Marshall D, Martin R R. Incremental eigenanalysis for classification. In: Proceedings of the British Machine Vision Conference. 1998, 1–10
Brand M. Incremental singular value decomposition of uncertain data with missing values. In: Proceedings of the 7th European Conference on Computer Vision. 2002, 707–720
Brand M. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 2006, 415(1): 20–30
Salas D F, Powell W B. Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of grid-level energy storage. INFORMS Journal on Computing, 2018, 30(1): 106–123
The corresponding author Weinan Zhang was supported by the “New Generation of AI 2030” Major Project (2018AAA0100900) and the National Natural Science Foundation of China (Grant Nos. 62076161, 61772333, 61632017).
Author information
Authors and Affiliations
Corresponding author
Additional information
Cheng Chen is currently a PhD candidate in APEX Lab at Shanghai Jiao Tong University, China. He received his bachelor’s degree at the Department of Computer Science in Shanghai Jiao Tong University, China in 2013. His research interest lies in matrix approximation, online learning and optimization.
Weinan Zhang received his PhD degree from University College London in 2016 and his BS degree from the ACM Class of Shanghai Jiao Tong University, China in 2011. He is currently an assistant professor with the Department of Computer Science, Shanghai Jiao Tong University. He has published over 50 research papers on conferences and journals, including KDD, SIGIR, AAAI, WWW, WSDM, ICDM, JMLR, IPM, and so on. His research interests include machine learning and big data mining, particularly, deep learning and reinforcement learning techniques for real-world data mining scenarios, such as computational advertising, recommendation systems, text mining, Web search, and knowledge graphs.
Yong Yu received his MS degree from the CS Department, East China Normal University, China. He is currently a professor with the Department of Computer Science, Shanghai Jiao Tong University, China and the Director of the Apex Data & Knowledge Management Lab. As the principal investigator, he took charge of several National Natural Science Foundation of China and China National High Tech (863) Program projects. His research interests include Web search, semantic search, data mining, and machine learning. He has published over 200 papers and served as a PC Member of several conferences, including WWW, RecSys, and a dozen of other related conferences, such as NIPS, ICML, SIGIR, ISWC, and so on.
Rights and permissions
About this article
Cite this article
Chen, C., Zhang, W. & Yu, Y. Efficient policy evaluation by matrix sketching. Front. Comput. Sci. 16, 165330 (2022). https://doi.org/10.1007/s11704-021-0354-4
DOI: https://doi.org/10.1007/s11704-021-0354-4