As the number of nodes in high performance computing (HPC) systems increases, collective I/O becomes an important issue and I/O aggregators are the key factors in improving the performance of collective I/O. When an HPC system uses non-exclusive scheduling, a different number of CPU cores per node can be assigned for MPI jobs; thus, I/O aggregators experience a disparity in their workloads and communication costs. Because the communication behaviors are influenced by the sequence of the I/O aggregators and by the number of CPU cores in neighbor nodes, changing the order of the nodes affects the communication costs in collective I/O. There are few studies, however, that seek to incorporate steps to adequately determine the node sequence. In this study, it was found that an inappropriate order of nodes results in an increase in the collective I/O communication costs. In order to address this problem, we propose the use of specific heuristic methods to regulate the node sequence. We also develop a prediction function in order to estimate the MPI-IO performance when using the proposed heuristic functions. The performance measurements indicated that the proposed scheme achieved its goal of preventing the performance degradation of the collective I/O process. For instance, in a multi-core cluster system with the Lustre file system, the read bandwidth of MPI-Tile-IO was improved by 7.61% to 17.21% and the write bandwidth of the benchmark was also increased by 17.05% to 26.49%.
Similar content being viewed by others
TOP 500 Supercomputer Sites (2010) http://www.top500.org/. Accessed 17 August 2010
Shan H, Shalf J (2007) Using IOR to analyze the I/O performance for HPC platforms. In: Cray users group meeting (CUG), Seattle, Washington
Zhang Z, Espinosa A, Iskra K, Raicu I, Foster I, Wilde M (2008) Design and evaluation of a collective IO model for loosely coupled petascale programming. In: Proc of the ACM/IEEE SC08 workshop on many-task computing on grids and supercomputers, pp 1–10. ISBN:978-1-4244-2872-4
Liao W-K, Choudhary A (2008) Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols. In: Proc of the 2008 ACM/IEEE conference on supercomputing, Article no 3. ISBN:978-1-4244-2834-2
Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I/O in ROMIO. In: Proc of the 7th symposium on the frontiers of massively parallel computation. IEEE Computer Society Press, Los Alamitos, pp 182–189. ISBN:0-7695-0087-0
Prost J-P, Treumann R, Hedges R, Jia B, Koniges A (2001) MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In: Proc of the 2001 ACM/IEEE conference on supercomputing. ISBN:1-58113-293-X
Nitzberg B, Lo V (1997) Collective buffering: improving parallel I/O performance. In: Proc of the IEEE international symposium on high performance distributed computing, pp 148–157. ISBN:0-8186-8117-9
Ma X, Winslett M, Lee J, Yu S (2003) Improving MPI-IO output performance with active buffering plus threads. In: Proc of the international parallel and distributed processing symposium. ISBN:0-7695-1926-1
Liao W-K, Coloma K, Choudhary A, Ward L (2007) Cooperative client-side file caching for MPI applications. Int J High Perform Comput Appl 21(2):144–154. ISSN:1094-3420
Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Tideman S (2005) Collective caching: application-aware client-side file caching. In: Proc of the 14th IEEE international symposium on high performance distributed computing, pp 81–90. ISBN:0-7803-9037-7
Liao W-K, Ching A, Coloma K, Nisar A, Choudhary A, Chen J, Sankaran R, Klasky S (2007) Using MPI file caching to improve parallel write performance for large-scale scientific applications. In: Proc of the 2007 ACM/IEEE conference on supercomputing, Article no 8. ISBN:978-1-59593-764-3
Liao W-K, Ching A, Coloma K, Choudhary A, Ward L (2007) An implementation and evaluation of client-side file caching for MPI-IO. In: Proc of the IEEE international parallel and distributed processing symposium. ISBN:1-4244-0910-1
Liao W-K, Ching A, Coloma K, Choudhary A, Kandemir M (2007) Improving MPI independent write performance using a two-stage write-behind buffering method. In: Proc of the IEEE IPDPS workshop on NSF next generation software program. ISBN:1-4244-0910-1
Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Pundit N (2006) Scalable design and implementations for MPI parallel overlapping I/O. IEEE Trans Parallel Distrib Syst 17(11):1264–1276. ISSN:1045-9219
Filgueira R, Carretero J, Singh DE, Calderón A, Núñez A (2010) Dynamic-CoMPI: dynamic optimization techniques for MPI parallel applications. J Supercomput. doi:10.1007/s11227-010-0440-0
Filgueira R, Singh DE, Pichel JC, Isaila F, Carretero J (2008) Data locality aware strategy for two-phase collective I/O. In: High performance computing for computational science—VECPAR 2008. LNCS, vol 5336. Springer, Berlin, pp 137–149. ISBN:978-3-540-92858-4
Thakur R, Choudhary A (1996) An extended two-phase method for accessing sections of out-of-core arrays. Sci Program 5(4):301–317. ISSN:1058-9244
Kotz D (1997) Disk-directed I/O for MIMD multiprocessors. ACM Trans Comput Syst 15(1):41–74. ISSN:0734-2071
Yu W, Vetter J (2008) ParColl: partitioned collective I/O on the cray XT. In: Proc of the 37th international conference on parallel processing, pp 562–569. ISBN:978-0-7695-3374-2
Kandemir M (2001) Compiler-directed collective-I/O. IEEE Trans Parallel Distrib Syst 12(12):1318–1331. ISSN:1045-9219
Patrick CM, Son SW, Kandemir M (2008) Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO. Oper Syst Rev 42(6):43–49. ISSN:0163-5980
Dickens PM, Logan J (2010) A high performance implementation of MPI-IO for a Lustre file system environment. Concurr Comput 22(11):1433–1449. ISSN:1532-0626
Dickens PM, Logan J (2009) Y-Lib: a user level library to increase the performance of MPI-IO in a Lustre file system environment. In: Proc of the 18th ACM international symposium on high performance distributed computing, pp 31–38. ISBN:978-1-60558-587-1
Nagle D, Serenyi D, Mattews A (2004) The panasas ActiveScale storage cluster—delivering scalable high bandwidth storage. In: Proc of the ACM/IEEE conference on supercomputing, pp 53–53. ISBN:0-7695-2153-3
Sun Grid Engine Home (2011) http://wikis.sun.com/display/GridEngine/Home. Accessed 19 June 2011
Portable Batch System (2011) http://www.nas.nasa.gov/Software/PBS/. Accessed 19 June 2011
Workload Management with LoadLeveler (2011) http://www.redbooks.ibm.com/abstracts/sg246038.html. Accessed 19 June 2011
Vienne J, Martinasso M, Vincent J-M, Méhaut J-F (2008) Predictive models for bandwidth sharing in high performance clusters. In: Proc of the IEEE international conference on cluster computing, pp 286–291. ISBN:978-1-4244-2639-3
Parallel I/O Benchmarking Consortium (2010) http://www.mcs.anl.gov/research/projects/pio-benchmark/. Accessed 17 August 2010
Sebepou Z, Magoutis K, Marazakis M, Bilas A (2008) A comparative experimental study of parallel file systems for large-scale data processing. In: Proc of the first USENIX workshop on large-scale computing, Article no 5. ISBN:978-1-931971-59-1
Borrill J, Oliker L, Shalf J, Shan H, Uselton A (2009) HPC global file system performance analysis using a scientific-application derived benchmark. Parallel Comput 35(6):358–373. ISSN:0167-8191
Bhatele A, Wesolowski L, Bohm E, Solomonik E, Kale LV (2010) Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar. Int J High Perform Comput Appl 24(4):411–427. ISSN:1094-3420
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cha, K., Maeng, S. Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling. J Supercomput 61, 966–996 (2012). https://doi.org/10.1007/s11227-011-0669-2
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0669-2