Abstract
As the extension of cyber–physical systems (CPSs), cyber–physical–social systems (CPSSs) seamlessly integrate cyber space, physical space, and social space. CPSS provide a more comprehensive smart space of perception for Reinforcement Learning (RL), so as to lead a revolution in Artificial Intelligence (AI), which urgently require innovation in computing system architecture. This paper aims to provide a comprehensive review and perspectives of RL architecture for collaborative computing system in CPSS. Firstly, we analyze CPSS AI’s features from data perspective, including multi-modal data fusion from multi-spaces, and rule discover and representation, collaborative computing system architectures for multi-spaces, intelligent decision-making (policy) discovery. After that, we propose to use the action aware transition tensor to fuse the CPSS data for collaborative RL. Then the typical architectures and methods of RL are surveyed. Furthermore, a tensor based unified collaborative computing reinforcement architecture are proposed for CPSS AI, including architecture and optimal policy solution. In the end, we summarize the paper and discuss the future work.
Similar content being viewed by others
References
Almasan P, Suárez-Varela J, Rusek K, Barlet-Ros P, Cabellos-Aparicio A (2022) Deep reinforcement learning meets graph neural networks: exploring a routing optimization use case. Comput Commun 196:184–194
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38
Barto AG (2018) Some learning tasks from a control perspective. In: 1990 Lectures in complex systems. CRC Press, pp 195–224
Barto AG, Sutton RS, Anderson CW (1988) Neuronlike adaptive elements that can solve difficult learning control problems. MIT Press, Cambridge
Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15(1):319–350
Bellemare MG, Veness J, Bowling M (2012) Investigating contingency awareness using Atari 2600 games. In: Twenty-sixth AAAI conference on artificial intelligence, 2012, pp 864–871
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Blad C, Bogh S, Kallesoe CS (2022) Data-driven offline reinforcement learning for HVAC-systems. Energy 261:125290
Brazell M, Li N, Navasca C, Tamon C (2013) Solving multilinear systems via tensor inversion. SIAM J Matrix Anal Appl 34(2):542–570
Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern C 38(2):156–172
Buşoniu L, de Bruin T, Tolić D, Kober J, Palunko I (2018) Reinforcement learning for control: performance, stability, and deep approximators. Annu Rev Control 46:8–28
Cherry S (1997) Some comments on singular value decomposition analysis. J Clim 10(7):1759–1761
Curran W, Brys T, Aha D, Taylor M, Smart WD (2016) Dimensionality reduced reinforcement learning for assistive robots. In: Proceedings of artificial intelligence for human–robot interaction at AAAI fall symposium series, 2016
Dönderler ME, Ulusoy Ö, Güdükbay U (2000) A rule-based approach to represent spatio-temporal relations in video data. In: International conference on advances in information systems, 2000. Springer, pp 409–418
Entezari N, Shiri ME, Moradi P (2010) A local graph clustering algorithm for discovering subgoals in reinforcement learning. Springer, Berlin
Fan Y, Li B, Favorite D, Singh N, Childers T, Rich P, Allcock W, Papka ME, Lan Z (2022) DRAS: deep reinforcement learning for cluster scheduling in high performance computing. IEEE Trans Parallel Distrib Syst 33(12):4903–4917
Feng J, Yang LT, Dai G, Wang W, Zou D (2019) A secure high-order Lanczos-based orthogonal tensor SVD for big data reduction in cloud environment. IEEE Trans Big Data 5(3):355–367
Fujita Y, Maeda S-I (n.d.) Clipped action policy gradient. arXiv preprint. arXiv:1802.07564
Fulpagare Y, Huang K-R, Liao Y-H, Wang C-C (2022) Optimal energy management for air cooled server fans using deep reinforcement learning control method. Energy Build 277:112542
Gao Y, Chen S-F, Lu X (2004) Research on reinforcement learning technology: a review. Acta Autom Sin 30(1):86–100
Garcıa J, Fernández F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16(1):1437–1480
Gibbons RS (1992) Game theory for applied economists. Princeton University Press, Princeton
Girgin S, Polat F, Alhajj R (2007) Positive impact of state similarity on reinforcement learning performance. IEEE Trans Syst Man Cybern B 37(5):1256–1270
Gosavi A (2009) Reinforcement learning: a tutorial survey and recent advances. INFORMS J Comput 21(2):178–192
Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor–critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 42(6):1291–1307
Guo B, Wang Z, Yu Z, Wang Y, Yen NY, Huang R, Zhou X (2015) Mobile crowd sensing and computing: the review of an emerging human-powered sensing paradigm. ACM Comput Surv 48(1):7
Hausknecht M, Stone P (2015) Deep recurrent Q-learning for partially observable MDPS. In: AAAI fall symposium on sequential decision making for intelligent agents, 2015
Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agents Multi-agent Syst 33(6):750–797
Huang BQ, Cao GY, Guo M (2005) Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance. In: Proceedings of 2005 international conference on machine learning and cybernetics, 2005, vol 1. IEEE, pp 85–89
Jing Y, Jiang W, Su G, Zhou Z, Wang Y (2014) A learning automata-based singular value decomposition and its application in recommendation system. In: International conference on intelligent computing, 2014. Springer
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Kim KD, Kumar PR (2012) Cyber–physical systems: a perspective at the centennial. Proc IEEE 100(Special Centennial Issue):1287–1308
Kious D, Tarrs P (n.d.) Reinforcement learning in social networks. arXiv preprint. arXiv:1601.00667
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274
Konda V (1999) Actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166
Kovács S (2001) SVD reduction in continuous environment reinforcement learning. In: International conference on computational intelligence, 2001. Springer, pp 719–738
Kovács S, Baranyi PZ (2003) Fuzzy q-learning in SVD reduced dynamic state-space. Prod Syst Inf Eng Publ Univ Miskolc 1:107–124
Krodel M, Kuhnert KD (2002) Reinforcement learning to drive a car by pattern matching. In: IECON, 2002, pp 1728–1733
Kuang L, Hao F, Yang LT, Lin M (2014) A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans Emerg Top Comput 2(3):280–291
Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: an overview of methods, challenges and prospects. In: Proceedings of the IEEE, 2015, pp 1449–1477
Lei L, Tan Y, Zheng K, Liu S, Zhang K, Shen X (2020) Deep reinforcement learning for autonomous internet of things: model, applications and challenges. IEEE Commun Surv Tutor 22(3):1722–1760
Levine S, Wagener N, Abbeel P (2015) Learning contact-rich manipulation skills with guided policy search. In: IEEE international conference on robotics and automation, 2015, pp 156–163
Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand–eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4–5):421–436
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (n.d.) Continuous control with deep reinforcement learning. arXiv preprint. arXiv:1509.02971
Liu Z, Yang DS, Wen D, Zhang WM, Mao W (2011) Cyber–physical–social systems for command and control. IEEE Intell Syst 26(4):92–96
Liu Q, Zhai JW, Zhang Z-Z, Zhong S, Zhou Q, Zhang P, Xu J (2018) A survey on deep reinforcement learning. Chin J Comput 41(1):1–27
Liu H, Cai K, Li P, Qian C, Zhao P, Wu X (2023) REDRL: a review-enhanced deep reinforcement learning model for interactive recommendation. Expert Syst Appl 213:118926
Luong NC, Hoang DT, Gong S, Niyato D, Wang P, Liang Y-C, Kim DI (2019) Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun Surv Tutor 21(4):3133–3174
Ma J, Yang LT, Apduhan BO, Huang R, Barolli L, Takizawa M (2005) Towards a smart world and ubiquitous intelligence: a walkthrough from smart things to smart hyperspaces and UbicKids. Int J Pervasive Comput Commun 1(1):53–68
Machado MC, Bowling M (n.d.) Learning purposeful behaviour in the absence of rewards. arXiv preprint. arXiv:1605.07700
Mahadevan S (1996) Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach Learn 22(1–3):159–195
Matignon L, Laurent GJ, Le Fort-Piat N (2012) Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. Knowl Eng Rev 27(1):1–31
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state by fast computing machines. J Chem Phys 21(6):1087–1092
Michie D, Chambers RA (1968) BOXES: an experiment in adaptive control. Mach Intell 2(2):137–152
Milner EC (n.d.) The theory of graphs and its applications. J Lond Math Soc s1-39(1)
Misra S, Goswami S, Taneja C (2016) Multivariate data fusion-based learning of video content and service distribution for cyber physical social systems. IEEE Trans Comput Soc Syst 3(1):1–12
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Naddaf Y, Naddaf Y, Veness J, Bowling M (2013) The Arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279
Nguyen ND, Nguyen T, Nahavandi S (2017) System design perspective for human-level agents using deep reinforcement learning: a survey. IEEE Access 5:27091–27102
Nguyen TT, Nguyen ND, Nahavandi S (2020) Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications. IEEE Trans Cybern 50(9):3826–3839
Nie J, Haykin S (1999) A dynamic channel assignment policy through Q-learning. IEEE Trans Neural Netw 10(6):1443–1455
Ning H, Liu H, Ma J, Yang LT, Huang R (2015) Cybermatics: cyber–physical–social-thinking hyperspace based science and technology. Future Gener Comput Syst 56:504–522
Notsu A, Honda K, Ichihashi H, Ido A, Komori Y (2013) Information compression effect based on PCA for reinforcement learning agents’ communication. In: Joint international conference on soft computing and intelligent systems, 2013, pp 1318–1321
Ossowski S, Hernández JZ, Belmonte MV et al (2005) Decision support for traffic management based on organisational and communicative multiagent abstractions. Transp Res C 13(4):272–298
Panait L, Luke S (2005) Cooperative multi-agent learning: the state of the art. Auton Agents Multi-agent Syst 11(3):387–434
Peng J, Williams RJ (1996) Incremental multi-step Q-learning. Mach Learn 22(1–3):283–290
Prat N, Comyn-Wattiau I, Akoka J (2011) Combining objects with rules to represent aggregation knowledge in data warehouse and OLAP systems. Data Knowl Eng 70(8):732–752
Puterman ML (2009) Markov decision processes: discrete stochastic dynamic programming. Technometrics 37(3):353–353
Rajkumar R, Lee I, Sha L, Stankovic J (2010) Cyber–physical systems: the next computing revolution. In: Design automation conference, 2010, pp 731–736
Romp G (1997) Game theory: introduction and applications. Oxford University Press, Oxford
Rosenfeld A, Taylor ME, Kraus S (2017) Speeding up tabular reinforcement learning using state–action similarities. In: Conference on autonomous agents and multiagent systems, 2017, pp 1722–1724
Roy N, Gordon GJ (2003) Exponential family PCA for belief compression in POMDPS. In: Advances in neural information processing systems, 2003, pp 1667–1674
Rummery GA (n.d.) Problem solving with reinforcement learning. PhD Thesis, University of Cambridge
Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, Cambridge
Sargent TJ (1987) Dynamic macroeconomic theory. Harvard University Press, Cambridge
Schraudolph NN, Yu J, Aberdeen D (2006) Fast online policy gradient learning with SMD gain vector adaptation. In: Advances in neural information processing systems, 2006, pp 1185–1192
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (n.d.) Proximal policy optimization algorithms. arXiv preprint. arXiv:1707.06347
Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275(5306):1593–1599
Sheth A, Anantharam P, Henson C (2013) Physical–cyber–social computing: an early 21st century approach. IEEE Intell Syst 1:78–82
Shi J, Wan J, Yan H, Hui S (2011) A survey of cyber physical systems. In: 2011 International conference on wireless communications and signal processing (WCSP), 2011
Shoeleh F, Asadpour M (2017) Graph based skill acquisition and transfer learning for continuous reinforcement learning domains. Pattern Recognit Lett 87:104–116
Skyrms B, Pemantle R (2009) A dynamic model of social network formation. In: Adaptive networks. Springer, pp 231–251
Sorber L (n.d.) Data fusion: tensor factorizations by complex optimization. PhD Thesis, Faculty of Engineering, KU Leuven, Leuven
Sun T, Shen H, Chen T, Li D (2022) Adaptive temporal difference learning with linear function approximation. IEEE Trans Pattern Anal Mach Intell 44(12):8812–8824
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the seventh international conference, 1990
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Sutton RS, Barto AG (2005) Reinforcement learning: an introduction, Bradford book. Mach Learn 16(1):285–286
Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, 2000, pp 1057–1063
Tamagawa D, Taniguchi E, Yamada T (2010) Evaluating city logistics measures using a multi-agent model. Procedia Soc Behav Sci 2(3):6002–6012
Tan Y, Liu W, Qiu Q (2009) Adaptive power management using reinforcement learning. In: Proceedings of the 2009 international conference on computer-aided design, 2009. ACM, pp 461–467
Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(July):1633–1685
Tesauro G (1995) TD-Gammon: a self-teaching backgammon program. In: Applications of neural networks. Springer, Boston, pp 267–285
Thorndike EL (1998) Animal intelligence: an experimental study of the associative processes in animals. Am Psychol 53(10):1125–1127
Tousi M, Hosseinian S, Jadidinejad A, Menhaj M (2008) Application of Saras learning algorithm for reactive power control in power system. In: Proceedings of the 2nd international conference on power and energy, 2008. IEEE, pp 1198–1202
Vogel A, Jurafsky D (2010) Learning to follow navigational directions. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, 2010. Association for Computational Linguistics, pp 806–814
Wang FY (2010) The emergence of intelligent enterprises: from CPS to CPSS. IEEE Intell Syst 25(4):85–88
Wang X, Yang LT, Feng J, Chen X, Deen MJ (2016) A tensor-based big service framework for enhanced living environments. IEEE Cloud Comput Mag 3(6):36–43
Wang P, Yang LT, Li J (2018) An edge cloud-assisted CPSS framework for smart city. IEEE Cloud Comput 5(5):37–46
Wang P, Yang LT, Li J, Chen J, Hu S (2019) Data fusion in cyber–physical–social systems: state-of-the-art and perspectives. Inf Fusion 51:42–57
Wang P, Yang LT, Li J, Zhou X (2020a) MMDP: a mobile-IoT based multi-modal reinforcement learning service framework. IEEE Trans Serv Comput 13(4):675–684
Wang P, Yang LT, Nie X, Ren Z, Li J, Kuang L (2020b) Data-driven software defined network attack detection: state-of-the-art and perspectives. Inf Sci 513:65–83
Wang P, Yang LT, Peng Y, Li J, Xie X (2020c) \({M^2}{T^2}\): the multivariate multi-step transition tensor for user mobility pattern prediction. IEEE Trans Netw Sci Eng 7(2):907–917
Wang P, Yang LT, Qian G, Li J, Yan Z (2020d) HO-OTSVD: a novel tensor decomposition and its incremental computation for cyber–physical–social networks (CPSN). IEEE Trans Netw Sci Eng 7(2):713–725
Wang P, Yang LT, Qian G, Lu F (2021) The cyber–physical–social transition tensor service framework. IEEE Trans Sustain Comput 6(3):481–492
Wang Z, Schaul T, Hessel M, Van Hasselt H, Lanctot M, De Freitas N (n.d.) Dueling network architectures for deep reinforcement learning. arXiv preprint. arXiv:1511.06581
Watkins C (1989a) Learning from delayed rewards. Robot Auton Syst 15(4):233–235
Watkins C (1989b) Learning from delayed rewards. PhD Thesis, Cambridge University
Wei Q, Liu D, Shi G (2015) A novel dual iterative Q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron 62(4):2509–2518
Whong C (n.d.) Foiling NYC’s taxi trip data. http://chriswhong.com/opendata/foil-nyc-taxi/
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229–256
Wisetjindawat W, Sano K, Matsumoto S (2005) Supply chain simulation for modeling the interactions in freight movement. J East Asia Soc Transp Stud 6:2991–3004
Xiang Y, Niu W, Liu J, Chen T, Han Z (2018a) A PCA-based model to predict adversarial examples on Q-learning of path finding. In: IEEE third international conference on data science in cyberspace, 2018, pp 773–780
Xiang Y, Niu W, Liu J, Chen T, Han Z (2018b) A PCA-based model to predict adversarial examples on q-learning of path finding. In: 2018 IEEE third international conference on data science in cyberspace (DSC), 2018, pp 773–780
Xu X, He H-G (2002) Residual-gradient-based neural reinforcement learning for the optimal control of an acrobat. In: Proceedings of the international symposium on intelligent control, 2002. IEEE, pp 758–763
Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31
Yau K-LA, Qadir J, Khoo HL, Ling MH, Komisarczuk P (2017) A survey on reinforcement learning models and algorithms for traffic signal control. ACM Comput Surv (CSUR) 50(3):1–38
Zeng J, Yang LT, Ma J (2016) A system-level modeling and design for cyber–physical–social systems. ACM Trans Embed Comput Syst 15(2):1–26
Zhao D, Shao K, Zhu Y, Li D, Chen Y, Wang H, Liu D-R, Zhou T, Wang C-H (2016) Review of deep reinforcement learning and discussions on the development of computer go. Control Theory Appl 33(6):701–717
Zhou Y (n.d.) Data-driven cyber–physical–social system for knowledge discovery in smart cities. PhD Thesis, University of Surrey
Acknowledgements
This work was supported by National Natural Science Foundation Project under Grants 62166047, 62101481, 62002313 and 62261060.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Wang, P., Jin, X. et al. Reinforcement learning architecture for cyber–physical–social AI: state-of-the-art and perspectives. Artif Intell Rev 56, 12655–12688 (2023). https://doi.org/10.1007/s10462-023-10450-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-023-10450-2