Abstract
Evolutionary algorithm (EA) is proved to be a promising way for parameter optimization in deep reinforcement learning (RL) in recent years. However, it still suffers from the curse of dimensionality when dealing with high-dimensional inputs. Based on experiments, we observe that only a few variables contribute significantly to the performance of large-scale RL policy. Intuitively, we propose a parallel random embedding framework to optimize strategies on multiple parameter sub-spaces to incorporate classical Evolution algorithms and techniques for the million-scale RL policy optimization. Experiments show that our approach has outperforming performance with Negatively Correlation Search (NCS) in the framework.
This work is supported by the Natural Science Foundation of China (Grant No. 61806090 and Grant No. 61672478), Guangdong Provincial Key Laboratory (Grant No. 2020B121201001), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2017ZT07X386), Shenzhen Science and Technology Program (Grant No. KQTD2016112514355531), the Program for University Key Laboratory of Guangdong Province (Grant No. 2017KSYS008).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Dujaili, A., Suresh, S.: Embedded bandits for large-scale black-box optimization. In: Singh, S.P., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9, 2017, San Francisco, California, USA. pp. 758–764. AAAI Press, New York (2017)
Binois, M., Ginsbourger, D., Roustant, O.: On the choice of the low-dimensional domain for global optimization via random embeddings. J. Global Optim. 76(1), 69–90 (2019). https://doi.org/10.1007/s10898-019-00839-1
Carpentier, A., Munos, R.: Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. Proc. Mach. Learn. Res. 22, 190–198 (2012)
Chrabaszcz, P., Loshchilov, I., Hutter, F.: Back to basics: Benchmarking canonical evolution strategies for playing atari. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1419–1426 (2018)
Conti, E., Madhavan, V., Such, F.P., Lehman, J., Stanley, K.O., Clune, J.: Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In: Advances in Neural Information Processing Systems 31: NeurIPS 2018, December 3–8, 2018, Montreal, Canada. pp. 5032–5043 (2018)
Kaban, A., Bootkrajang, J., Durrant, R.J.: Towards large scale continuous EDA: a random matrix theory perspective. In: Proceeding of the Fifteenth Annual Conference on Genetic and Evolutionary Computation Conference, GECCO 2013, p. 383. ACM Press, New York (2013)
Kakade, S.M.: A natural policy gradient. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001, Vancouver, British Columbia, Canada], pp. 1531–1538. MIT Press, Cambridge, MA (2001)
Knight, J.N., Lunacek, M.: Reducing the space-time complexity of the CMA-ES. In: Lipson, H. (ed.) Genetic and Evolutionary Computation Conference, GECCO 2007, Proceedings, London, England, UK, July 7–11, 2007, pp. 658–665. ACM Press, New York (2007)
Loshchilov, I.: A computationally efficient limited memory CMA-ES for large scale optimization. In: Arnold, D.V. (ed.) Genetic and Evolutionary Computation Conference, pp. 397–404. ACM Press, New York (2014)
Ma, X., et al.: A survey on cooperative co-evolutionary algorithms. IEEE Trans. Evol. Comput. 23(3), 421–441 (2019)
Machado, M.C., Bellemare, M.G., et al.: Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. J. Artif. Intell. Res. 61(1), 523–562 (2018)
Müller, N., Glasmachers, T.: Challenges in high-dimensional reinforcement learning with evolution strategies. In: Auger, A., Fonseca, C.M., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds.) PPSN 2018. LNCS, vol. 11102, pp. 411–423. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99259-4_33
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Mnih, V., Badia, A.P., et al.: Asynchronous methods for deep reinforcement learning. Proc. Mach. Learn. Res. 48, 1928–1937 (2016)
Mousavi, S.S., Schukat, M., Howley, E.: Deep reinforcement learning: an overview. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 426–440. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-56991-8_32
Qian, H., Hu, Y.Q., Yu, Y.: Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, pp. 1946–1952. AAAI Press, New York (2016)
Qian, H., Yu, Y.: Scaling simultaneous optimistic optimization for high-dimensional non-convex functions with low effective dimensions. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 2000–2006. AAAI Press, New York (2016)
Qian, H., Yu, Y.: Solving high-dimensional multi-objective optimization problems with low effective dimensions. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 875–881. AAAI Press, New York (2017)
Salimans, T., Ho, J., et al.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017)
Sanyang, M.L., Kabán, A.: REMEDA: random embedding EDA for optimising functions with intrinsic dimension. In: Handl, J., Hart, E., Lewis, P.R., López-Ibáñez, M., Ochoa, G., Paechter, B. (eds.) PPSN 2016. LNCS, vol. 9921, pp. 859–868. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45823-6_80
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
Such, F.P., Madhavan, V., et al.: Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567 (2018)
Tang, K., Yang, P., Yao, X.: Negatively correlated search. IEEE J. Sel. Areas Commun. 34(3), 542–550 (2016)
Wang, Z., Zoghi, M., et al.: Bayesian optimization in high dimensions via random embeddings. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI 2013, pp. 1778–1784. AAAI Press (2013)
Yang, P., Tang, K., Yao, X.: A parallel divide-and-conquer-based evolutionary algorithm for large-scale optimization. IEEE Access 7, 163105–163118 (2019)
Yang, Z., Tang, K., Yao, X.: Large scale evolutionary optimization using cooperative coevolution. Inf. Sci. 178(15), 2985–2999 (2008)
Zhang, L., Mahdavi, M., Jin, R., Yang, T., Zhu, S.: Recovering the optimal solution by dual random projection. In: Proceedings of the 26th Annual Conference on Learning Theory, vol. 30, pp. 135–157 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Hyper-parameters. In NCSRE, the populations numbers is set to 7 and the population size of each sub-process is set to 2. For a network with 1.7 million parameters, we set embedding dimension to 100 considering the computational resource limitation and optimization effect. Generally speaking, algorithms 1000-dimensions effective space can achieve better performance if matrix production operation accelerates techniques are adopted. For NCS-C, the learning rate of sigma and adaptive negatively correlated coefficient are set the same as its origin paper, and the initial value of sigma is searched by grid method and set to 0.01 when the bound is [–0.1,0.1]. The bound of parameters defined our search space and should be carefully confirmed in different RL problems. We determined the bound of D-dimension space according to existing research and most parameters in well-performance network lie on [–10,10]. In our experiment, alpha is set to a constant for convenience. Highlightly, the length of the phase and epoch are corresponding to each other to match the sub-process and random embedding framework. In NCS-C, each sigma is adapted for every epoch iterations and epoch are usually set to 5 or 10. As for \(L_{phase}\), there are two-edges effects that too large may lead to less embedding spaces, but too short phase leads to insufficient optimization in each subspace. Considering these factors, we set to 30 and 5. What’s more, each network is re-evaluated with t times to reduce noise in games and take average scores as fitness. The initial value of t is set to 3 and is adaptively increased to 6 as \(3*\frac{t}{t_{max}}\) (Table 2).
Implemention. The implement code was published on Github https://github.com/Desein-Yang/NCS-RL.
Network Architecture. The parameters about the policy network architecture in our experiments are listed in Table 3.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Q., Yang, P., Tang, K. (2021). Parallel Random Embedding with Negatively Correlated Search. In: Tan, Y., Shi, Y. (eds) Advances in Swarm Intelligence. ICSI 2021. Lecture Notes in Computer Science(), vol 12690. Springer, Cham. https://doi.org/10.1007/978-3-030-78811-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-78811-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78810-0
Online ISBN: 978-3-030-78811-7
eBook Packages: Computer ScienceComputer Science (R0)