ABSTRACT
Scientific computing systems are becoming increasingly complex and indeed are close to reaching a critical limit in manageability when using current human-in-the-loop techniques. In order to address this problem, autonomic, goal-driven management actions based on machine learning must be applied end to end across the scientific computing landscape. Even though researchers proposed architectures and design choices for autonomic computing systems more than a decade ago, practical realization of such systems has been limited, especially in scientific computing environments. Growing interest and recent developments in machine learning have spurred proposals to apply machine learning for goal-based optimization of computing systems in an autonomous fashion. We review recent work that uses machine learning algorithms to improve computer system performance, identify gaps and open issues. We propose a hierarchical architecture that builds on the earlier proposals for autonomic computing systems to realize an autonomous science infrastructure.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR abs/1603.04467 (2016). arXiv:1603.04467 http://arxiv.org/abs/1603.04467Google Scholar
- Nazim Agoulmine, Sasitharan Balasubramaniam, Dmitri Botvich, John Strassner, Elyes Lehtihet, and William Donnelly. 2006. Challenges for autonomic network management. In 1st IEEE International Workshop on Modelling Autonomic Communications Environments.Google Scholar
- Mark Allman, Vern Paxson, and Ethan Blanton. 2009. TCP congestion control. Technical Report.Google Scholar
- Peter Bodík, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, and David Patterson. 2009. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud'09). USENIX Association, Berkeley, CA, USA, Article 12. http://dl.acm.org/citation.cfm?id=1855533.1855545 Google ScholarDigital Library
- Lutz Bornmann. 2012. Measuring the societal impact of research: research is less and less assessed on scientific impact alone - we should aim to quantify the increasingly important contributions of science to society. EMBO reports 13, 8 (2012), 673--676.Google Scholar
- Philip Campbell and Michelle Grayson. 2014. Assessing science. Nature 511, S49 (2014).Google Scholar
- Danilo Carastan-Santos and Raphael Y. de Camargo. 2017. Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 32, 13 pages. Google ScholarDigital Library
- Giuliano Casale. 2017. Accelerating Performance Inference over Closed Systems by Asymptotic Methods. Proc. ACM Meas. Anal. Comput. Syst. 1, 1, Article 17 (2017), 36 pages. Google ScholarDigital Library
- David D. Clark, Craig Partridge, J. Christopher Ramming, and John T. Wroclawski. 2003. A Knowledge Plane for the Internet. In Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM '03). ACM, New York, NY, USA, 3--10. Google ScholarDigital Library
- Jeff Dean. 2017. Machine Learning for Systems and Systems for Machine Learning. http://learningsys.org/nips17/assets/slides/dean-nips17.pdf.Google Scholar
- Deepmind. 2018 (accessed March 3, 2018). DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40.Google Scholar
- Peter A. Dinda and David R. O'Hallaron. 2000. Host Load Prediction Using Linear Models. Cluster Computing 3, 4 (Oct. 2000), 265--280. Google ScholarDigital Library
- Nicolas D'Ippolito, Victor Braberman, Jeff Kramer, Jeff Magee, Daniel Sykes, and Sebastian Uchitel. 2014. Hope for the Best, Prepare for the Worst: Multi-tier Control for Adaptive Systems. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 688--699. Google ScholarDigital Library
- Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS '17). ACM, New York, NY, USA, 1285--1298. Google ScholarDigital Library
- Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2013. Failure prediction for HPC systems and applications: Current situation and open issues. The International Journal of High Performance Computing Applications 27, 3 (2013), 273--282. arXiv:https://doi.org/10.1177/1094342013488258 Google ScholarDigital Library
- A. G. Ganek and T. A. Corbi. 2003. The dawning of the autonomic computing era. IBM Systems Journal 42, 1 (2003), 5--18. Google ScholarDigital Library
- E. Gaussier, D. Glesser, V. Reis, and D. Trystram. 2015. Improving backfilling by using machine learning to predict running times. In SC15: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10. Google ScholarDigital Library
- Daniel Gewirth. {n. d.}. The HKL manual. ({n. d.}). https://dasher.wustl.edu/bio5325/reading/hkl-manual.pdfGoogle Scholar
- Raúl Gracia-Tinedo, Josep Sampé, Edgar Zamora, Marc Sánchez-Artigas, Pedro García-López, Yosef Moatti, and Eran Rom. 2017. Crystal: Software-Defined Storage for Multi-Tenant Object Stores. In 15th USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, 243--256. https://www.usenix.org/conference/fast17/technical-sessions/presentation/gracia-tinedo Google ScholarDigital Library
- Soguy Mak-KarÃl' Gueye, NoÃńl De Palma, ÃL'ric Rutten, Alain Tchana, and Nicolas Berthier. 2014. Coordinating self-sizing and self-repair managers for multi-tier systems. Future Generation Computer Systems 35 (2014), 14--26. Special Section: Integration of Cloud Computing and Body Sensor Networks; Guest Editors: Giancarlo Fortino and Mukaddim Pathan. Google ScholarDigital Library
- Nikolas Roman Herbst, Nikolaus Huber, Samuel Kounev, and Erich Amrehn. 2013. Self-adaptive Workload Classification and Forecasting for Proactive Resource Provisioning. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE '13). ACM, New York, NY, USA, 187--198. Google ScholarDigital Library
- https://www.es.net. {n. d.}. Science DMZ: Data Transfer Nodes. https://fasterdata.es.net/science-dmz/DTN/.Google Scholar
- Markus C Huebscher and Julie A McCann. 2008. A survey of autonomic computing-degrees, models, and applications. Comput. Surveys 40, 3 (2008), 7. Google ScholarDigital Library
- Hameed Hussain, Saif Ur Rehman Malik, Abdul Hameed, Samee Ullah Khan, Gage Bickler, Nasro Min-Allah, Muhammad Bilal Qureshi, Limin Zhang, Wang Yongji, Nasir Ghani, Joanna Kolodziej, Albert Y. Zomaya, Cheng-Zhong Xu, Pavan Balaji, Abhinav Vishnu, Fredric Pinel, Johnatan E. Pecero, Dzmitry Kliazovich, Pascal Bouvry, Hongxiang Li, Lizhe Wang, Dan Chen, and Ammar Rayes. 2013. A survey on resource allocation in high performance distributed computing systems. Parallel Comput. 39, 11 (2013), 709--736. Google ScholarDigital Library
- IBM. 2018 (accessed April 3, 2018). An architectural blueprint for autonomic computing. (2018 (accessed April 3, 2018)). http://www-03.ibm.com/autonomic/pdfs/AC%20Blueprint%20White%20Paper%20V7.pdf.Google Scholar
- JGI: Joint Genome Institute. {n. d.}. DOE Metrics/Statistics. https://jgi.doe.gov/our-projects/statistics/.Google Scholar
- J. O. Kephart. 2005. Research challenges of autonomic computing. In 27th International Conference on Software Engineering. 15--22. Google ScholarDigital Library
- J. O. Kephart and D. M. Chess. 2003. The vision of autonomic computing. Computer 36, 1 (Jan 2003), 41--50. Google ScholarDigital Library
- Jeffrey O Kephart and David M Chess. 2003. The vision of autonomic computing. Computer 36, 1 (2003), 41--50. Google ScholarDigital Library
- Rajkumar Kettimuthu, Zhengchun Liu, David Wheelerd, Ian Foster, Katrin Heitmann, and Franck Cappello. 2017. Transferring a Petabyte in a Day. In 4th International Workshop on Innovating the Network for Data Intensive Science. 10.Google Scholar
- I. K. Kim, W. Wang, Y. Qi, and M. Humphrey. 2016. Empirical Evaluation of Workload Forecasting Techniques for Predictive Cloud Resource Scaling. In 2016 IEEE 9th International Conference on Cloud Computing (CLOUD). 1--10.Google Scholar
- Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2017. The Case for Learned Index Structures. arXiv preprint arXiv:1712.01208 (2017).Google Scholar
- Zhiling Lan, Ziming Zheng, and Yawei Li. 2010. Toward Automated Anomaly Identification in Large-Scale Systems. IEEE Trans. Parallel Distrib. Syst. 21, 2 (Feb. 2010), 174--187. Google ScholarDigital Library
- Julia Lane. 2009. Assessing the Impact of Science Funding. Science 324, 5932 (2009), 1273--1275.Google Scholar
- Bo Li, Edgar A. León, and Kirk W. Cameron. 2017. COS: A Parallel Performance Model for Dynamic Variations in Processor Speed, Memory Speed, and Thread Concurrency. In 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 155--166. Google ScholarDigital Library
- Yan Li, Kenneth Chang, Oceane Bel, Ethan L. Miller, and Darrell D. E. Long. 2017. CAPES: Unsupervised Storage Performance Tuning Using Neural Network-based Deep Reinforcement Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 42, 14 pages. Google ScholarDigital Library
- Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015). http://arxiv.org/abs/1509.02971Google Scholar
- Marin Litoiu, Mary Shaw, Gabriel Tamura, Norha M. Villegas, Hausi A. Müller, Holger Giese, Romain Rouvoy, and Eric Rutten. 2017. What Can Control Theory Teach Us About Assurances in Self-Adaptive Software Systems?. In Software Engineering for Self-Adaptive Systems III. Assurances, Rogério de Lemos, David Garlan, Carlo Ghezzi, and Holger Giese (Eds.). Springer International Publishing, Cham, 90--134.Google Scholar
- L. Liu, S. E. Alaoui, and B. Ramamurthy. 2014. Multi-layer energy savings in optical core networks. In IEEE International Conference on Advanced Networks and Telecommuncations Systems. 1--3.Google Scholar
- L. Liu and B. Ramamurthy. 2011. A dynamic local method for bandwidth adaptation in bundle links to conserve energy in core networks. In 5th IEEE International Conference on Advanced Telecommunication Systems and Networks. 1--6.Google Scholar
- Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 167--178. Google ScholarDigital Library
- Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2017. Towards a Smart Data Transfer Node. In 4th International Workshop on Innovating the Network for Data Intensive Science. 10.Google Scholar
- Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S.V. Rao. 2018. Cross-geography Scientific Data Transfer Trends and User Behavior Patterns. In 27th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 12. Google ScholarDigital Library
- Zhengchun Liu, Rajkumar Kettimuthu, Sven Leyffer, Prashant Palkar, and Ian Foster. 2017. A Mathematical Programming- and Simulation-Based Framework to Evaluate Cyberinfrastructure Design Choices. In IEEE 13th International Conference on e-Science. 148--157.Google ScholarCross Ref
- Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In USENIX Annual Technical Conference. USENIX Association, Santa Clara, CA, 391--402. https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani Google ScholarDigital Library
- Aniruddha Marathe, Rushil Anirudh, Nikhil Jain, Abhinav Bhatele, Jayaraman Thiagarajan, Bhavya Kailkhura, Jae-Seung Yeom, Barry Rountree, and Todd Gamblin. 2017. Performance Modeling Under Resource Constraints Using Deep Transfer Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 31, 12 pages. Google ScholarDigital Library
- A. Matsunaga and J. A. B. Fortes. 2010. On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 495--504. Google ScholarDigital Library
- Albert Mestres, Alberto Rodriguez-Natal, Josep Carner, Pere Barlet-Ros, Eduard Alarcón, Marc Solé, Victor Muntés, David Meyer, Sharon Barkai, Mike J. Hibbett, Giovani Estrada, Khaldun Maruf, Florin Coras, Vina Ermagan, Hugo Latapie, Chris Cassar, John Evans, Fabio Maino, Jean C. Walrand, and Albert Cabellos. 2016. Knowledge-Defined Networking. CoRR abs/1606.06222 (2016). arXiv:1606.06222 http://arxiv.org/abs/1606.06222 Google ScholarDigital Library
- Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A Large-Scale Study of Flash Memory Failures in the Field. In ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '15). ACM, New York, NY, USA, 177--190. Google ScholarDigital Library
- Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. CoRR abs/1706.04972 (2017). arXiv:1706.04972 http://arxiv.org/abs/1706.04972Google Scholar
- Movidius. 2018 (accessed April 3, 2018). Intel Movidius Neural Compute Stick. (2018 (accessed April 3, 2018)). https://developer.movidius.com/.Google Scholar
- Steven S Muchnick. 1997. Advanced compiler design implementation. Morgan Kaufmann. Google ScholarDigital Library
- M. R. Nami and K. Bertels. 2007. A Survey of Autonomic Computing Systems. In 3rd International Conference on Autonomic and Autonomous Systems. 26--26. Google ScholarDigital Library
- S. Nanda, F. Zafari, C. DeCusatis, E. Wedaa, and B. Yang. 2016. Predicting network attack patterns in SDN using machine learning approach. In IEEE Conference on Network Function Virtualization and Software Defined Networks. 167--172.Google Scholar
- National Research Council. 1998. Assessing the Value of Research in the Chemical Sciences. National Academies Press. https://books.google.com/books?id=F0-2Nn3llYQCGoogle Scholar
- Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Xiongchao Tang, and Wenguang Chen. 2013. Cost-effective Cloud HPC Resource Provisioning by Building Semielastic Virtual Clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 56, 12 pages. Google ScholarDigital Library
- Manish Parashar and Salim Hariri. 2005. Autonomic Computing: An Overview. In Unconventional Programming Paradigms, Jean-Pierre Banâtre, Pascal Fradet, Jean-Louis Giavitto, and Olivier Michel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 257--269. Google ScholarDigital Library
- A. K. Paul, A. Goyal, F. Wang, S. Oral, A. R. Butt, M. J. Brim, and S. B. Srinivasa. 2017. I/O load balancing for big data HPC applications. In 2017 IEEE International Conference on Big Data (Big Data). 233--242.Google Scholar
- Teresa Penfield, Matthew J. Baker, Rosa Scoble, and Michael C. Wykes. 2014. Assessment, evaluations, and definitions of research impact: A review. Research Evaluation 23, 1 (2014), 21--32.Google ScholarCross Ref
- Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. 2015. Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale. SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 379--392. Google ScholarDigital Library
- Rosalie Ruegg and Gretchen Jordan. 2007. Overview of evaluation methods for R&D programs. Technical Report. U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy.Google Scholar
- Eric Rutten, Nicolas Marchand, and Daniel Simon. 2015. Feedback Control as MAPE-K loop in Autonomic Computing. Research Report RR-8827. INRIA Sophia Antipolis - Méditerranée; INRIA Grenoble - Rhône-Alpes. https://hal-lirmm.ccsd.cnrs.fr/lirmm-01241594 draft soumis à LNCS.Google Scholar
- Mazeiar Salehie and Ladan Tahvildari. 2005. Autonomic Computing: Emerging Trends and Open Problems. SIGSOFT Softw. Eng. Notes 30, 4 (May 2005), 1--7. Google ScholarDigital Library
- Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash Reliability in Production: The Expected and the Unexpected. In 14th USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, 67--80. https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder Google ScholarDigital Library
- Giorgio Stampa, Marta Arias, David Sanchez-Charles, Victor Muntés-Mulero, and Albert Cabellos. 2017. A Deep-Reinforcement Learning Approach for Software-Defined Networking Routing Optimization. CoRR abs/1709.07080 (2017). arXiv:1709.07080 http://arxiv.org/abs/1709.07080Google Scholar
- Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT Press Cambridge. Google ScholarDigital Library
- Andrew S Tanenbaum. 2009. Modern operating system. Pearson Education, Inc. Google ScholarDigital Library
- G. Tesauro. 2007. Reinforcement Learning in Autonomic Computing: A Manifesto and Case Studies. IEEE Internet Computing 11, 1 (Jan 2007), 22--30. Google ScholarDigital Library
- Gerald Tesauro, David M. Chess, William E. Walsh, Rajarshi Das, Alla Segal, Ian Whalley, Jeffrey O. Kephart, and Steve R. White. 2004. A Multi-Agent Systems Approach to Autonomic Computing. In 3rd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS '04). IEEE Computer Society, Washington, DC, USA, 464--471. Google ScholarDigital Library
- Sean Wallace, Xu Yang, Venkatram Vishwanath, William E. Allcock, Susan Coghlan, Michael E. Papka, and Zhiling Lan. 2016. A Data Driven Scheduling Approach for Power Management on HPC Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 56, 11 pages. http://dl.acm.org/citation.cfm?id=3014904.3014979 Google ScholarDigital Library
- Christopher Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3-4 (1992), 279--292. Google ScholarDigital Library
- S. R. White, J. E. Hanson, I. Whalley, D. M. Chess, and J. O. Kephart. 2004. An architectural approach to autonomic computing. In International Conference on Autonomic Computing. 2--9. Google ScholarDigital Library
- Steve R White, James E Hanson, Ian Whalley, David M Chess, Alla Segal, and Jeffrey O Kephart. 2006. Autonomic computing: Architectural approach and prototype. Integrated Computer-Aided Engineering 13, 2 (2006), 173--188. Google ScholarDigital Library
- Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and Faster Jobs Using Fewer Resources. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 26, 14 pages. Google ScholarDigital Library
- Pengfei Zheng and Benjamin C. Lee. 2018. Hound: Causal Learning for Datacenterscale Straggler Diagnosis. In Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 2. 1--36. Google ScholarDigital Library
- Zhou, Xu Yang, Zhiling Lan, Paul Rich, Wei Tang, Vitali Morozov, and Narayan Desai. 2016. Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3269--3282. Google ScholarDigital Library
- Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan. 2015. I/O-Aware Batch Scheduling for Petascale Computing Systems. In IEEE International Conference on Cluster Computing. 254--263. Google ScholarDigital Library
Index Terms
- Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues
Recommendations
Towards Autonomic GIPSY
EASE '08: Proceedings of the Fifth IEEE Workshop on Engineering of Autonomic and Autonomous SystemsThe goal of the autonomic GIPSY (AGIPSY) is to make the General Intensional Programming System (GIPSY) capable of self-managing to a far greater extent than it does it now. This paper presents the AGIPSY architecture for autonomic computing based on ...
Towards autonomic computing: a new self-management method
AICI'11: Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part IAutonomic computing is a new technology which aims to hide the software complexity by means of "technologies managing technologies". The paper concludes and analysis the related concept of autonomic computing, the architecture and the working mechanisms ...
Towards autonomic computing systems
Firstly, an in-depth analysis is presented on the concept of autonomic computing and a generic architecture is put forward for autonomic computing systems. Then, a taxonomy is put forward for system adaptations and the concept and architecture are ...
Comments