skip to main content
10.1145/3526063.3535350acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Server-Side Workload Identification for HPC I/O Requests

Published:27 June 2022Publication History

ABSTRACT

In this paper, we develop a method to identify High Performance Computing (HPC) workloads from a stream of incoming I/O requests. This characterization of workloads could then be used to intelligently schedule the I/O requests in the parallel file system (PFS) that most HPC systems use. We use a deep learning model for this purpose that is designed to pick up changes in the workload as they occur. We show that our method accurately determines the workload characteristics when evaluated on publicly available server-side HPC traces. We also show that the I/O scheduling based on such a characterization can substantially increase the available I/O bandwidth and thus reduce the latencies for the HPC workloads.

Skip Supplemental Material Section

Supplemental Material

PERMAVOST22-perma02.mp4

mp4

180.2 MB

References

  1. 2014. uppercaseMPI-IO Test. http://freshmeat.sourceforge.net/projects/mpiiotest/.Google ScholarGoogle Scholar
  2. Abien Fred Agarap. 2019. Deep Learning using Rectified Linear Units (ReLU). arxiv: 1803.08375 [cs.NE]Google ScholarGoogle Scholar
  3. Jean Luca Bez, Francieli Zanon Boito, Ramon Nou, Alberto Miranda, Toni Cortes, and Philippe OA Navaux. 2020. Adaptive request scheduling for the I/O forwarding layer using reinforcement learning. Future Generation Computer Systems, Vol. 112 (2020), 1156--1169.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jean Luca Bez, Francieli Zanon Boito, Lucas M Schnorr, Philippe OA Navaux, and Jean-Francc ois Méhaut. 2017. Twins: Server access coordination in the I/O forwarding layer. In 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). IEEE, 116--123.Google ScholarGoogle ScholarCross RefCross Ref
  5. Francieli Zanon Boito, Rodrigo Virote Kassick, Philippe OA Navaux, and Yves Denneulin. 2016. Automatic I/O scheduling algorithm selection for parallel file systems. Concurrency and Computation: Practice and Experience, Vol. 28, 8 (2016), 2457--2472.Google ScholarGoogle ScholarCross RefCross Ref
  6. Francieli Zanon Boito, Ramon Nou, Laércio Lima Pilla, Jean Luca Bez, Jean-Francc ois Méhaut, Toni Cortes, and Philippe OA Navaux. 2019. On server-side file access pattern matching. In 2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 217--224.Google ScholarGoogle ScholarCross RefCross Ref
  7. Raphaël Bolze, Franck Cappello, Eddy Caron, Michel Daydé, Frédéric Desprez, Emmanuel Jeannot, Yvon Jégou, Stephane Lanteri, Julien Leduc, Noredine Melab, et al. 2006. Grid'5000: A large scale and highly reconfigurable experimental grid testbed. The International Journal of High Performance Computing Applications, Vol. 20, 4 (2006), 481--494.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning (ICML'10), Vol. 28. 3.Google ScholarGoogle Scholar
  9. Feng Chen, David A Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. ACM SIGMETRICS Performance Evaluation Review, Vol. 37, 1 (2009), 181--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Feng Chen, Rubao Lee, and Xiaodong Zhang. 2011. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 266--277.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ana Gainaru, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. 2015. Scheduling the I/O of HPC applications under congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1013--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Apache Hadoop. 2022. Hadoop. https://hadoop.apache.org/.Google ScholarGoogle Scholar
  14. Red Hat. 2022. NOOP. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/ch06s04s03.Google ScholarGoogle Scholar
  15. John J Hopfield. 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, Vol. 79, 8 (1982), 2554--2558.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv: 1502.03167 [cs.LG]Google ScholarGoogle Scholar
  17. Diederik P. Kingma et al. 2017. Adam: A Method for Stochastic Optimization. arxiv: 1412.6980 [cs.LG]Google ScholarGoogle Scholar
  18. Julian M Kunkel, Michaela Zimmer, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andriy Chut, Thomas Bönisch, Jakob Lüttgau, Roman Michel, et al. 2014. The SIOX architecture--coupling automatic monitoring and optimization of parallel I/O. In International Supercomputing Conference. Springer, 245--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  20. Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S. Vazhkudai. 2014. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX.Google ScholarGoogle Scholar
  21. Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, and Michela Taufer. 2016. Machine learning predictions of runtime and IO traffic on high-end clusters. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 255--258.Google ScholarGoogle ScholarCross RefCross Ref
  22. OrangeFS. 2022. The uppercaseOrangeuppercaseFS uppercaseProject. http://www.orangefs.org/.Google ScholarGoogle Scholar
  23. Tirthak Patel, Suren Byna, Glenn K Lockwood, Nicholas J Wright, Philip Carns, Robert Ross, and Devesh Tiwari. 2020. Uncovering Access, Reuse, and Sharing Characteristics of {I/O-Intensive} Files on {Large-Scale} Production $$HPC$$ Systems. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 91--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Arnab K Paul, Olaf Faaland, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Ali R Butt. 2020. Understanding hpc application i/o behavior using system level statistics. In 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 202--211.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks, Vol. 61 (2015), 85--117.Google ScholarGoogle Scholar
  26. Hongzhang Shan and John Shalf. 2007. Using IOR to analyze the I/O performance for HPC platforms. Technical Report. Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US).Google ScholarGoogle Scholar
  27. Abraham Silberschatz, Peter B Galvin, and Greg Gagne. 2006. Operating system concepts. John Wiley & Sons.Google ScholarGoogle Scholar
  28. Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, and Samuel Lang. 2011. Server-side I/O coordination for parallel file systems. In SC'11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Apache Spark. 2022. Spark. https://spark.apache.org/.Google ScholarGoogle Scholar
  30. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  31. Sagar Thapaliya, Purushotham Bangalore, Jay Lofstead, Kathrn Mohror, and Adam Moody. 2014. IO-cop: Managing concurrent accesses to shared parallel file system. In 2014 43rd International Conference on Parallel Processing Workshops. IEEE, 52--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.Google ScholarGoogle Scholar
  33. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Server-Side Workload Identification for HPC I/O Requests

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PERMAVOST '22: Proceedings of the 2nd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy
        June 2022
        30 pages
        ISBN:9781450393140
        DOI:10.1145/3526063

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 June 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader