skip to main content
10.1145/3217197.3217205acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues

Published:11 June 2018Publication History

ABSTRACT

Scientific computing systems are becoming increasingly complex and indeed are close to reaching a critical limit in manageability when using current human-in-the-loop techniques. In order to address this problem, autonomic, goal-driven management actions based on machine learning must be applied end to end across the scientific computing landscape. Even though researchers proposed architectures and design choices for autonomic computing systems more than a decade ago, practical realization of such systems has been limited, especially in scientific computing environments. Growing interest and recent developments in machine learning have spurred proposals to apply machine learning for goal-based optimization of computing systems in an autonomous fashion. We review recent work that uses machine learning algorithms to improve computer system performance, identify gaps and open issues. We propose a hierarchical architecture that builds on the earlier proposals for autonomic computing systems to realize an autonomous science infrastructure.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR abs/1603.04467 (2016). arXiv:1603.04467 http://arxiv.org/abs/1603.04467Google ScholarGoogle Scholar
  2. Nazim Agoulmine, Sasitharan Balasubramaniam, Dmitri Botvich, John Strassner, Elyes Lehtihet, and William Donnelly. 2006. Challenges for autonomic network management. In 1st IEEE International Workshop on Modelling Autonomic Communications Environments.Google ScholarGoogle Scholar
  3. Mark Allman, Vern Paxson, and Ethan Blanton. 2009. TCP congestion control. Technical Report.Google ScholarGoogle Scholar
  4. Peter Bodík, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, and David Patterson. 2009. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud'09). USENIX Association, Berkeley, CA, USA, Article 12. http://dl.acm.org/citation.cfm?id=1855533.1855545 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lutz Bornmann. 2012. Measuring the societal impact of research: research is less and less assessed on scientific impact alone - we should aim to quantify the increasingly important contributions of science to society. EMBO reports 13, 8 (2012), 673--676.Google ScholarGoogle Scholar
  6. Philip Campbell and Michelle Grayson. 2014. Assessing science. Nature 511, S49 (2014).Google ScholarGoogle Scholar
  7. Danilo Carastan-Santos and Raphael Y. de Camargo. 2017. Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 32, 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Giuliano Casale. 2017. Accelerating Performance Inference over Closed Systems by Asymptotic Methods. Proc. ACM Meas. Anal. Comput. Syst. 1, 1, Article 17 (2017), 36 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David D. Clark, Craig Partridge, J. Christopher Ramming, and John T. Wroclawski. 2003. A Knowledge Plane for the Internet. In Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM '03). ACM, New York, NY, USA, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeff Dean. 2017. Machine Learning for Systems and Systems for Machine Learning. http://learningsys.org/nips17/assets/slides/dean-nips17.pdf.Google ScholarGoogle Scholar
  11. Deepmind. 2018 (accessed March 3, 2018). DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40.Google ScholarGoogle Scholar
  12. Peter A. Dinda and David R. O'Hallaron. 2000. Host Load Prediction Using Linear Models. Cluster Computing 3, 4 (Oct. 2000), 265--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nicolas D'Ippolito, Victor Braberman, Jeff Kramer, Jeff Magee, Daniel Sykes, and Sebastian Uchitel. 2014. Hope for the Best, Prepare for the Worst: Multi-tier Control for Adaptive Systems. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 688--699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS '17). ACM, New York, NY, USA, 1285--1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2013. Failure prediction for HPC systems and applications: Current situation and open issues. The International Journal of High Performance Computing Applications 27, 3 (2013), 273--282. arXiv:https://doi.org/10.1177/1094342013488258 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. G. Ganek and T. A. Corbi. 2003. The dawning of the autonomic computing era. IBM Systems Journal 42, 1 (2003), 5--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Gaussier, D. Glesser, V. Reis, and D. Trystram. 2015. Improving backfilling by using machine learning to predict running times. In SC15: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daniel Gewirth. {n. d.}. The HKL manual. ({n. d.}). https://dasher.wustl.edu/bio5325/reading/hkl-manual.pdfGoogle ScholarGoogle Scholar
  19. Raúl Gracia-Tinedo, Josep Sampé, Edgar Zamora, Marc Sánchez-Artigas, Pedro García-López, Yosef Moatti, and Eran Rom. 2017. Crystal: Software-Defined Storage for Multi-Tenant Object Stores. In 15th USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, 243--256. https://www.usenix.org/conference/fast17/technical-sessions/presentation/gracia-tinedo Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Soguy Mak-KarÃl' Gueye, NoÃńl De Palma, ÃL'ric Rutten, Alain Tchana, and Nicolas Berthier. 2014. Coordinating self-sizing and self-repair managers for multi-tier systems. Future Generation Computer Systems 35 (2014), 14--26. Special Section: Integration of Cloud Computing and Body Sensor Networks; Guest Editors: Giancarlo Fortino and Mukaddim Pathan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nikolas Roman Herbst, Nikolaus Huber, Samuel Kounev, and Erich Amrehn. 2013. Self-adaptive Workload Classification and Forecasting for Proactive Resource Provisioning. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE '13). ACM, New York, NY, USA, 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. https://www.es.net. {n. d.}. Science DMZ: Data Transfer Nodes. https://fasterdata.es.net/science-dmz/DTN/.Google ScholarGoogle Scholar
  23. Markus C Huebscher and Julie A McCann. 2008. A survey of autonomic computing-degrees, models, and applications. Comput. Surveys 40, 3 (2008), 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hameed Hussain, Saif Ur Rehman Malik, Abdul Hameed, Samee Ullah Khan, Gage Bickler, Nasro Min-Allah, Muhammad Bilal Qureshi, Limin Zhang, Wang Yongji, Nasir Ghani, Joanna Kolodziej, Albert Y. Zomaya, Cheng-Zhong Xu, Pavan Balaji, Abhinav Vishnu, Fredric Pinel, Johnatan E. Pecero, Dzmitry Kliazovich, Pascal Bouvry, Hongxiang Li, Lizhe Wang, Dan Chen, and Ammar Rayes. 2013. A survey on resource allocation in high performance distributed computing systems. Parallel Comput. 39, 11 (2013), 709--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. IBM. 2018 (accessed April 3, 2018). An architectural blueprint for autonomic computing. (2018 (accessed April 3, 2018)). http://www-03.ibm.com/autonomic/pdfs/AC%20Blueprint%20White%20Paper%20V7.pdf.Google ScholarGoogle Scholar
  26. JGI: Joint Genome Institute. {n. d.}. DOE Metrics/Statistics. https://jgi.doe.gov/our-projects/statistics/.Google ScholarGoogle Scholar
  27. J. O. Kephart. 2005. Research challenges of autonomic computing. In 27th International Conference on Software Engineering. 15--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. O. Kephart and D. M. Chess. 2003. The vision of autonomic computing. Computer 36, 1 (Jan 2003), 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jeffrey O Kephart and David M Chess. 2003. The vision of autonomic computing. Computer 36, 1 (2003), 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rajkumar Kettimuthu, Zhengchun Liu, David Wheelerd, Ian Foster, Katrin Heitmann, and Franck Cappello. 2017. Transferring a Petabyte in a Day. In 4th International Workshop on Innovating the Network for Data Intensive Science. 10.Google ScholarGoogle Scholar
  31. I. K. Kim, W. Wang, Y. Qi, and M. Humphrey. 2016. Empirical Evaluation of Workload Forecasting Techniques for Predictive Cloud Resource Scaling. In 2016 IEEE 9th International Conference on Cloud Computing (CLOUD). 1--10.Google ScholarGoogle Scholar
  32. Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2017. The Case for Learned Index Structures. arXiv preprint arXiv:1712.01208 (2017).Google ScholarGoogle Scholar
  33. Zhiling Lan, Ziming Zheng, and Yawei Li. 2010. Toward Automated Anomaly Identification in Large-Scale Systems. IEEE Trans. Parallel Distrib. Syst. 21, 2 (Feb. 2010), 174--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Julia Lane. 2009. Assessing the Impact of Science Funding. Science 324, 5932 (2009), 1273--1275.Google ScholarGoogle Scholar
  35. Bo Li, Edgar A. León, and Kirk W. Cameron. 2017. COS: A Parallel Performance Model for Dynamic Variations in Processor Speed, Memory Speed, and Thread Concurrency. In 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 155--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yan Li, Kenneth Chang, Oceane Bel, Ethan L. Miller, and Darrell D. E. Long. 2017. CAPES: Unsupervised Storage Performance Tuning Using Neural Network-based Deep Reinforcement Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 42, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015). http://arxiv.org/abs/1509.02971Google ScholarGoogle Scholar
  38. Marin Litoiu, Mary Shaw, Gabriel Tamura, Norha M. Villegas, Hausi A. Müller, Holger Giese, Romain Rouvoy, and Eric Rutten. 2017. What Can Control Theory Teach Us About Assurances in Self-Adaptive Software Systems?. In Software Engineering for Self-Adaptive Systems III. Assurances, Rogério de Lemos, David Garlan, Carlo Ghezzi, and Holger Giese (Eds.). Springer International Publishing, Cham, 90--134.Google ScholarGoogle Scholar
  39. L. Liu, S. E. Alaoui, and B. Ramamurthy. 2014. Multi-layer energy savings in optical core networks. In IEEE International Conference on Advanced Networks and Telecommuncations Systems. 1--3.Google ScholarGoogle Scholar
  40. L. Liu and B. Ramamurthy. 2011. A dynamic local method for bandwidth adaptation in bundle links to conserve energy in core networks. In 5th IEEE International Conference on Advanced Telecommunication Systems and Networks. 1--6.Google ScholarGoogle Scholar
  41. Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 167--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2017. Towards a Smart Data Transfer Node. In 4th International Workshop on Innovating the Network for Data Intensive Science. 10.Google ScholarGoogle Scholar
  43. Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S.V. Rao. 2018. Cross-geography Scientific Data Transfer Trends and User Behavior Patterns. In 27th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhengchun Liu, Rajkumar Kettimuthu, Sven Leyffer, Prashant Palkar, and Ian Foster. 2017. A Mathematical Programming- and Simulation-Based Framework to Evaluate Cyberinfrastructure Design Choices. In IEEE 13th International Conference on e-Science. 148--157.Google ScholarGoogle ScholarCross RefCross Ref
  45. Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In USENIX Annual Technical Conference. USENIX Association, Santa Clara, CA, 391--402. https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Aniruddha Marathe, Rushil Anirudh, Nikhil Jain, Abhinav Bhatele, Jayaraman Thiagarajan, Bhavya Kailkhura, Jae-Seung Yeom, Barry Rountree, and Todd Gamblin. 2017. Performance Modeling Under Resource Constraints Using Deep Transfer Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 31, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Matsunaga and J. A. B. Fortes. 2010. On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 495--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Albert Mestres, Alberto Rodriguez-Natal, Josep Carner, Pere Barlet-Ros, Eduard Alarcón, Marc Solé, Victor Muntés, David Meyer, Sharon Barkai, Mike J. Hibbett, Giovani Estrada, Khaldun Maruf, Florin Coras, Vina Ermagan, Hugo Latapie, Chris Cassar, John Evans, Fabio Maino, Jean C. Walrand, and Albert Cabellos. 2016. Knowledge-Defined Networking. CoRR abs/1606.06222 (2016). arXiv:1606.06222 http://arxiv.org/abs/1606.06222 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A Large-Scale Study of Flash Memory Failures in the Field. In ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '15). ACM, New York, NY, USA, 177--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. CoRR abs/1706.04972 (2017). arXiv:1706.04972 http://arxiv.org/abs/1706.04972Google ScholarGoogle Scholar
  51. Movidius. 2018 (accessed April 3, 2018). Intel Movidius Neural Compute Stick. (2018 (accessed April 3, 2018)). https://developer.movidius.com/.Google ScholarGoogle Scholar
  52. Steven S Muchnick. 1997. Advanced compiler design implementation. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. R. Nami and K. Bertels. 2007. A Survey of Autonomic Computing Systems. In 3rd International Conference on Autonomic and Autonomous Systems. 26--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Nanda, F. Zafari, C. DeCusatis, E. Wedaa, and B. Yang. 2016. Predicting network attack patterns in SDN using machine learning approach. In IEEE Conference on Network Function Virtualization and Software Defined Networks. 167--172.Google ScholarGoogle Scholar
  55. National Research Council. 1998. Assessing the Value of Research in the Chemical Sciences. National Academies Press. https://books.google.com/books?id=F0-2Nn3llYQCGoogle ScholarGoogle Scholar
  56. Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Xiongchao Tang, and Wenguang Chen. 2013. Cost-effective Cloud HPC Resource Provisioning by Building Semielastic Virtual Clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 56, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Manish Parashar and Salim Hariri. 2005. Autonomic Computing: An Overview. In Unconventional Programming Paradigms, Jean-Pierre Banâtre, Pascal Fradet, Jean-Louis Giavitto, and Olivier Michel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 257--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. K. Paul, A. Goyal, F. Wang, S. Oral, A. R. Butt, M. J. Brim, and S. B. Srinivasa. 2017. I/O load balancing for big data HPC applications. In 2017 IEEE International Conference on Big Data (Big Data). 233--242.Google ScholarGoogle Scholar
  59. Teresa Penfield, Matthew J. Baker, Rosa Scoble, and Michael C. Wykes. 2014. Assessment, evaluations, and definitions of research impact: A review. Research Evaluation 23, 1 (2014), 21--32.Google ScholarGoogle ScholarCross RefCross Ref
  60. Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. 2015. Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale. SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 379--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Rosalie Ruegg and Gretchen Jordan. 2007. Overview of evaluation methods for R&D programs. Technical Report. U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy.Google ScholarGoogle Scholar
  62. Eric Rutten, Nicolas Marchand, and Daniel Simon. 2015. Feedback Control as MAPE-K loop in Autonomic Computing. Research Report RR-8827. INRIA Sophia Antipolis - Méditerranée; INRIA Grenoble - Rhône-Alpes. https://hal-lirmm.ccsd.cnrs.fr/lirmm-01241594 draft soumis à LNCS.Google ScholarGoogle Scholar
  63. Mazeiar Salehie and Ladan Tahvildari. 2005. Autonomic Computing: Emerging Trends and Open Problems. SIGSOFT Softw. Eng. Notes 30, 4 (May 2005), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash Reliability in Production: The Expected and the Unexpected. In 14th USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, 67--80. https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Giorgio Stampa, Marta Arias, David Sanchez-Charles, Victor Muntés-Mulero, and Albert Cabellos. 2017. A Deep-Reinforcement Learning Approach for Software-Defined Networking Routing Optimization. CoRR abs/1709.07080 (2017). arXiv:1709.07080 http://arxiv.org/abs/1709.07080Google ScholarGoogle Scholar
  66. Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT Press Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Andrew S Tanenbaum. 2009. Modern operating system. Pearson Education, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. G. Tesauro. 2007. Reinforcement Learning in Autonomic Computing: A Manifesto and Case Studies. IEEE Internet Computing 11, 1 (Jan 2007), 22--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Gerald Tesauro, David M. Chess, William E. Walsh, Rajarshi Das, Alla Segal, Ian Whalley, Jeffrey O. Kephart, and Steve R. White. 2004. A Multi-Agent Systems Approach to Autonomic Computing. In 3rd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS '04). IEEE Computer Society, Washington, DC, USA, 464--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Sean Wallace, Xu Yang, Venkatram Vishwanath, William E. Allcock, Susan Coghlan, Michael E. Papka, and Zhiling Lan. 2016. A Data Driven Scheduling Approach for Power Management on HPC Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 56, 11 pages. http://dl.acm.org/citation.cfm?id=3014904.3014979 Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Christopher Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3-4 (1992), 279--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. S. R. White, J. E. Hanson, I. Whalley, D. M. Chess, and J. O. Kephart. 2004. An architectural approach to autonomic computing. In International Conference on Autonomic Computing. 2--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Steve R White, James E Hanson, Ian Whalley, David M Chess, Alla Segal, and Jeffrey O Kephart. 2006. Autonomic computing: Architectural approach and prototype. Integrated Computer-Aided Engineering 13, 2 (2006), 173--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and Faster Jobs Using Fewer Resources. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 26, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Pengfei Zheng and Benjamin C. Lee. 2018. Hound: Causal Learning for Datacenterscale Straggler Diagnosis. In Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 2. 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Zhou, Xu Yang, Zhiling Lan, Paul Rich, Wei Tang, Vitali Morozov, and Narayan Desai. 2016. Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3269--3282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan. 2015. I/O-Aware Batch Scheduling for Petascale Computing Systems. In IEEE International Conference on Cluster Computing. 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        AI-Science'18: Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science
        June 2018
        53 pages
        ISBN:9781450358620
        DOI:10.1145/3217197

        Copyright © 2018 ACM

        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 June 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader