Skip to main content
Log in

Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As the number of nodes in high performance computing (HPC) systems increases, collective I/O becomes an important issue and I/O aggregators are the key factors in improving the performance of collective I/O. When an HPC system uses non-exclusive scheduling, a different number of CPU cores per node can be assigned for MPI jobs; thus, I/O aggregators experience a disparity in their workloads and communication costs. Because the communication behaviors are influenced by the sequence of the I/O aggregators and by the number of CPU cores in neighbor nodes, changing the order of the nodes affects the communication costs in collective I/O. There are few studies, however, that seek to incorporate steps to adequately determine the node sequence. In this study, it was found that an inappropriate order of nodes results in an increase in the collective I/O communication costs. In order to address this problem, we propose the use of specific heuristic methods to regulate the node sequence. We also develop a prediction function in order to estimate the MPI-IO performance when using the proposed heuristic functions. The performance measurements indicated that the proposed scheme achieved its goal of preventing the performance degradation of the collective I/O process. For instance, in a multi-core cluster system with the Lustre file system, the read bandwidth of MPI-Tile-IO was improved by 7.61% to 17.21% and the write bandwidth of the benchmark was also increased by 17.05% to 26.49%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. TOP 500 Supercomputer Sites (2010) http://www.top500.org/. Accessed 17 August 2010

  2. Shan H, Shalf J (2007) Using IOR to analyze the I/O performance for HPC platforms. In: Cray users group meeting (CUG), Seattle, Washington

    Google Scholar 

  3. Zhang Z, Espinosa A, Iskra K, Raicu I, Foster I, Wilde M (2008) Design and evaluation of a collective IO model for loosely coupled petascale programming. In: Proc of the ACM/IEEE SC08 workshop on many-task computing on grids and supercomputers, pp 1–10. ISBN:978-1-4244-2872-4

    Chapter  Google Scholar 

  4. Liao W-K, Choudhary A (2008) Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols. In: Proc of the 2008 ACM/IEEE conference on supercomputing, Article no 3. ISBN:978-1-4244-2834-2

    Google Scholar 

  5. Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I/O in ROMIO. In: Proc of the 7th symposium on the frontiers of massively parallel computation. IEEE Computer Society Press, Los Alamitos, pp 182–189. ISBN:0-7695-0087-0

    Chapter  Google Scholar 

  6. Prost J-P, Treumann R, Hedges R, Jia B, Koniges A (2001) MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In: Proc of the 2001 ACM/IEEE conference on supercomputing. ISBN:1-58113-293-X

    Google Scholar 

  7. Nitzberg B, Lo V (1997) Collective buffering: improving parallel I/O performance. In: Proc of the IEEE international symposium on high performance distributed computing, pp 148–157. ISBN:0-8186-8117-9

    Chapter  Google Scholar 

  8. Ma X, Winslett M, Lee J, Yu S (2003) Improving MPI-IO output performance with active buffering plus threads. In: Proc of the international parallel and distributed processing symposium. ISBN:0-7695-1926-1

    Google Scholar 

  9. Liao W-K, Coloma K, Choudhary A, Ward L (2007) Cooperative client-side file caching for MPI applications. Int J High Perform Comput Appl 21(2):144–154. ISSN:1094-3420

    Article  Google Scholar 

  10. Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Tideman S (2005) Collective caching: application-aware client-side file caching. In: Proc of the 14th IEEE international symposium on high performance distributed computing, pp 81–90. ISBN:0-7803-9037-7

    Chapter  Google Scholar 

  11. Liao W-K, Ching A, Coloma K, Nisar A, Choudhary A, Chen J, Sankaran R, Klasky S (2007) Using MPI file caching to improve parallel write performance for large-scale scientific applications. In: Proc of the 2007 ACM/IEEE conference on supercomputing, Article no 8. ISBN:978-1-59593-764-3

    Google Scholar 

  12. Liao W-K, Ching A, Coloma K, Choudhary A, Ward L (2007) An implementation and evaluation of client-side file caching for MPI-IO. In: Proc of the IEEE international parallel and distributed processing symposium. ISBN:1-4244-0910-1

    Google Scholar 

  13. Liao W-K, Ching A, Coloma K, Choudhary A, Kandemir M (2007) Improving MPI independent write performance using a two-stage write-behind buffering method. In: Proc of the IEEE IPDPS workshop on NSF next generation software program. ISBN:1-4244-0910-1

    Google Scholar 

  14. Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Pundit N (2006) Scalable design and implementations for MPI parallel overlapping I/O. IEEE Trans Parallel Distrib Syst 17(11):1264–1276. ISSN:1045-9219

    Article  Google Scholar 

  15. Filgueira R, Carretero J, Singh DE, Calderón A, Núñez A (2010) Dynamic-CoMPI: dynamic optimization techniques for MPI parallel applications. J Supercomput. doi:10.1007/s11227-010-0440-0

    Google Scholar 

  16. Filgueira R, Singh DE, Pichel JC, Isaila F, Carretero J (2008) Data locality aware strategy for two-phase collective I/O. In: High performance computing for computational science—VECPAR 2008. LNCS, vol 5336. Springer, Berlin, pp 137–149. ISBN:978-3-540-92858-4

    Chapter  Google Scholar 

  17. Thakur R, Choudhary A (1996) An extended two-phase method for accessing sections of out-of-core arrays. Sci Program 5(4):301–317. ISSN:1058-9244

    Google Scholar 

  18. Kotz D (1997) Disk-directed I/O for MIMD multiprocessors. ACM Trans Comput Syst 15(1):41–74. ISSN:0734-2071

    Article  MathSciNet  Google Scholar 

  19. Yu W, Vetter J (2008) ParColl: partitioned collective I/O on the cray XT. In: Proc of the 37th international conference on parallel processing, pp 562–569. ISBN:978-0-7695-3374-2

    Chapter  Google Scholar 

  20. Kandemir M (2001) Compiler-directed collective-I/O. IEEE Trans Parallel Distrib Syst 12(12):1318–1331. ISSN:1045-9219

    Article  Google Scholar 

  21. Patrick CM, Son SW, Kandemir M (2008) Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO. Oper Syst Rev 42(6):43–49. ISSN:0163-5980

    Article  Google Scholar 

  22. Dickens PM, Logan J (2010) A high performance implementation of MPI-IO for a Lustre file system environment. Concurr Comput 22(11):1433–1449. ISSN:1532-0626

    Google Scholar 

  23. Dickens PM, Logan J (2009) Y-Lib: a user level library to increase the performance of MPI-IO in a Lustre file system environment. In: Proc of the 18th ACM international symposium on high performance distributed computing, pp 31–38. ISBN:978-1-60558-587-1

    Chapter  Google Scholar 

  24. Nagle D, Serenyi D, Mattews A (2004) The panasas ActiveScale storage cluster—delivering scalable high bandwidth storage. In: Proc of the ACM/IEEE conference on supercomputing, pp 53–53. ISBN:0-7695-2153-3

    Google Scholar 

  25. Sun Grid Engine Home (2011) http://wikis.sun.com/display/GridEngine/Home. Accessed 19 June 2011

  26. Portable Batch System (2011) http://www.nas.nasa.gov/Software/PBS/. Accessed 19 June 2011

  27. Workload Management with LoadLeveler (2011) http://www.redbooks.ibm.com/abstracts/sg246038.html. Accessed 19 June 2011

  28. Vienne J, Martinasso M, Vincent J-M, Méhaut J-F (2008) Predictive models for bandwidth sharing in high performance clusters. In: Proc of the IEEE international conference on cluster computing, pp 286–291. ISBN:978-1-4244-2639-3

    Google Scholar 

  29. Parallel I/O Benchmarking Consortium (2010) http://www.mcs.anl.gov/research/projects/pio-benchmark/. Accessed 17 August 2010

  30. Sebepou Z, Magoutis K, Marazakis M, Bilas A (2008) A comparative experimental study of parallel file systems for large-scale data processing. In: Proc of the first USENIX workshop on large-scale computing, Article no 5. ISBN:978-1-931971-59-1

    Google Scholar 

  31. Borrill J, Oliker L, Shalf J, Shan H, Uselton A (2009) HPC global file system performance analysis using a scientific-application derived benchmark. Parallel Comput 35(6):358–373. ISSN:0167-8191

    Article  Google Scholar 

  32. Bhatele A, Wesolowski L, Bohm E, Solomonik E, Kale LV (2010) Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar. Int J High Perform Comput Appl 24(4):411–427. ISSN:1094-3420

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kwangho Cha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cha, K., Maeng, S. Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling. J Supercomput 61, 966–996 (2012). https://doi.org/10.1007/s11227-011-0669-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-011-0669-2

Keywords

Navigation