Abstract
The advent of multi-core architectures provides an opportunity for accelerating parallelism in mesh-based applications. This multi-core environment, however, imposes challenges not addressed by conventional graph-partitioning techniques that are originally designed for distributed-memory uniprocessors. As the first step to exploit the multi-core platform, this paper presents experimental evaluation to understand partitioning performance on small-scaled heterogeneous multi-core clusters. With results and analyses gathered, we propose a hierarchical framework for resource-aware graph partitioning on heterogeneous multi-core clusters. Preliminary evaluation demonstrates the potential of the framework and motivates directions for incorporating application requirements into graph partitioning.
Similar content being viewed by others
Notes
While the OS and other software versions were the same on the testbed, we compiled numerical libraries such as ATLAS 3.9.14 and CLAPACK 3.1.1.1 on different microprocessors to harvest maximum optimization.
Compilation of certain libraries on which the HPCC benchmarks rely was performed separately on each machine architecture, and the resulted binaries were placed in different folders on the frontend.
Poisson and SynApp are available on request to aubanel@unb.ca.
156,061 mesh vertices and 467,315 edges.
For brevity, the partitions for Top_v1 and Top_v2 will be named henceforth after the underlying topology.
We found out that JOSTLE constantly created disconnected partitions for four-level processor graph with P=20. Thus, we opted for three levels to minimize the likelihood of producing disjointed partitions.
We have submitted the respective graphs to DIMACS10 [38] for public use.
References
Alam, S.R., Agarwal, P.K., Hampton, S.S., Ong, H.: Experimental evaluation of molecular dynamics simulations on multi-core systems. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V. (eds.) High Performance Computing—HiPC 2008. Lecture Notes in Computer Science, vol. 5374, pp. 131–141. Springer, Berlin (2008). doi:10.1007/978-3-540-89894-8_15
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Tech. rep, EECS Department, University of California, Berkeley (2006)
Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009). doi:10.1145/1562764.1562783
Aubanel, E.: Resource-aware load balancing of parallel applications. In: Udoh, E., Wang, F.Z. (eds.) Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, pp. 12–21. IGI Global, Hershey (2009)
Barnard, S.T., Simon, H.D.: Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurrency 6(2), 101–117 (1994). doi:10.1002/cpe.433006020
Bhatelé, A., Kalé, L.V.: Quantifying network contention on large parallel machines. Parallel Process. Lett. 19(4), 553–572 (2009). doi:10.1142/S0129626409000419 (Special Issue on Large-Scale Parallel Processing)
Borkar, S.Y., Dubey, P., Kahn, K.C., Kuck, D.J., Mulder, H., Pawlowski, S.S., Rattner, J.R.: Platform 2015: Intel processor and platform evolution for the next decade. Tech. Rep. White Paper, Intel Corporation (2005). ftp://download.intel.com/technology/computing/archinnov/platform2015/download/Platform_2015.pdf
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in hpc applications. In: Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010), Pisa, Italy, pp. 180–186 (2010)
Bui, T., Jones, C.: A heuristic for reducing fill-in sparse matrix factorization. In: Proceedings of the 6th SIAM Conf. Parallel Processing for Scientific Computing, pp. 445–452. SIAM, Philadelphia (1993)
Canon, L.C., Dubuisson, O., Gustedt, J., Jeannot, E.: Defining and controlling the heterogeneity of a cluster: the Wrekavoc tool. J. Syst. Softw. 83(5), 786–802 (2010). doi:10.1016/j.jss.2009.11.734
Chai, L., Gao, Q., Panda, D.: Understanding the impact of multi-core architecture in cluster computing: A case study with Intel dual-core system. In: Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2007), Rio De Janeiro, pp. 471–478. IEEE, New York (2007). doi:10.1109/CCGRID.2007.119
Chan, S.Y., Ling, T.C., Aubanel, E.: Benchmarking and profiling heterogeneous multi-core clusters using graph-partitioning workload. Tech. rep. tech. report cs-2011-01, Faculty of Computer Science and Information Technology, University of Malaya (2011)
Chartrand, G., Zhang, P.: Introduction to Graph Theory. Walter Rudin Series in Advanced Mathematics. McGraw-Hill Higher Education, Singapore (2005)
Chen, J., Taylor, V.E.: Mesh partitioning for efficient use of distributed systems. IEEE Trans. Parallel Distrib. Syst. 13(1), 67–79 (2002). doi:10.1109/71.980027
Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: Mpipp: an automatic profile-guided parallel process placement toolset for smp clusters and multiclusters. In: Proceedings of the 20th Annual International Conference on Supercomputing, Cairns, Queensland, Australia, pp. 353–360 (2006). doi:10.1145/1183401.1183451
Clout, B., Aubanel, E.: Ehgrid: an emulator of heterogeneous computational grids. In: IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2009), Rome, pp. 1–8 (2009). doi:10.1109/IPDPS.2009.5161167
Cybenko, G.: Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput. 7(2), 279–301 (1989). doi:10.1016/0743-7315(89)90021-X
Devine, K., Boman, E., Heaphy, R., Hendrickson, B., Vaughan, C.: Zoltan data management services for parallel dynamic applications. Comput. Sci. Eng. 7(2), 90–97 (2002)
Dümmler, J., Rauber, T., Rünger, G.: Mapping algorithms for multiprocessor tasks on multi-core clusters. In: Proc. of the 37th International Conference on Parallel Processing (ICPP 2008), Portland, Oregon, USA, pp. 141–148. IEEE Computer Society, Los Alamitos (2008). doi:10.1109/ICPP.2008.42
Faik, J., Flaherty, J.E., Gervasio, L.G., Teresco, J.D., Devine, K.D.: A model for resource aware load balancing on heterogeneous clusters. Tech. Rep. CS-05-01, Williams College Department of Computer Science (2005)
Gropp, W., Gunter, D., Taylor, V.: Fpmpi-2: Fast profiling library for mpi (2001). http://www.mcs.anl.gov/research/projects/fpmpi/WWW/
Hendrickson, B.: Load balancing fictions, falsehoods and fallacies. Appl. Math. Model. 25, 99–108 (2000)
Hendrickson, B., Kolda, T.G.: Graph partitioning models for parallel computing. Parallel Comput. 26(12), 1519–1534 (2000). doi: 10.1016/S0167-8191(00)00042-9
Hendrickson, B., Leland, R.: A multilevel algorithm for partitioning graphs. In: Karin, S. (ed.) Proceedings of the ACM/IEEE conference on Supercomputing, p. 28. ACM, New York (1995). doi: 10.1145/224170.224228
Hood, R., Jin, H., Mehrotra, P., Chang, J., Djomehri, J., Gavali, S., Jespersen, D., Taylor, K., Biswas, R.: Performance impact of resource contention in multicore systems. In: IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), Atlanta, GA, pp. 1–12. IEEE, New York (2010). doi:10.1109/IPDPS.2010.5470399
Huang, S., Aubanel, E., Bhavsar, V.: Pagrid: A mesh partitioner for computational grids. J. Grid Comput. 4(1), 71–88 (2006)
Jeannot, E., Mercier, G.: Near-optimal placement of mpi processes on hierarchical numa architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010—Parallel Processing. Lecture Notes in Computer Science, vol. 6272, pp. 199–210. Springer, Berlin (2010). doi:10.1007/978-3-642-15291-7_20
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998). doi:10.1137/S1064827595287997
Karypis, G., Kumar, V.: Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0 (1998). http://glaros.dtc.umn.edu/gkhome/views/metis
Kayi, A., El-Ghazawi, T., Newby, G.B.: Performance issues in emerging homogeneous multi-core architectures. Simul. Model. Pract. Theory 17(9), 1485–1499 (2009). doi:10.1016/j.simpat.2009.06.014
Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49(1), 291–307 (1970)
Koenig, G.A., Kalé, L.V.: Optimizing distributed application performance using dynamic grid topology-aware load balancing. In: International Parallel and Distributed Processing Symposium, Long Beach, CA, pp. 1–10 (2007)
Korkhov, V.V., Krzhizhanovskaya, V.V., Sloot, P.: A grid-based virtual reactor: parallel performance and adaptive load balancing. J. Parallel Distrib. Comput. 68(5), 596–608 (2008). doi:10.1016/j.jpdc.2007.08.010
Kurc, O., Will, K.: An iterative parallel workload balancing framework for direct condensation of substructures. Comput. Methods Appl. Mech. Eng. 196, 2084–2096 (2007). doi:10.1016/j.cma.2006.07.015
Leng, T., Ali, R., Hsieh, J., Mashayekhi, V., Rooholamini, R.: Performance impact of process mapping on small-scale smp clusters—a case study using high performance linpack. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02), pp. 236–243. IEEE, New York (2002). doi:10.1109/IPDPS.2002.1016657
Levon, J.: Oprofile—a system profiler for Linux (2004). http://oprofile.sourceforge.net/
Luszczek, P., Dongarra, J.J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., McCalpin, J., Bailey, D., Takahashi, D.: Introduction to the hpc challenge benchmark suite. Tech. Rep. Paper LBNL-57493, Lawrence Berkeley National Laboratory (2005)
Meyerhenke, H.: 10th dimacs implementation challenge—graph partitioning and graph clustering (2011). http://www.cc.gatech.edu/dimacs10/
Meyerhenke, H., Monien, B., Schamberger, S.: Graph partitioning and disturbed diffusion. Parallel Comput. 35(10–11), 544–569 (2009). doi:10.1016/j.parco.2009.09.006
Moulitsas, I., Karypis, G.: Architecture aware partitioning algorithms. In: Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, vol. 5022, pp. 42–53. Springer, Berlin (2008). doi:10.1007/978-3-540-69501-1
Peng, L., Peir, J.K., Prakash, T.K., Staelin, C., Chen, Y.K., Koppelmana, D.: Memory hierarchy performance measurement of commercial dual-core desktop processors. J. Syst. Archit. 54(8), 816–828 (2008). doi:10.1016/j.sysarc.2008.02.004
Schloegel, K., Karyis, G., Kumar, V.: Graph partitioning for high-performance scientific simulations. In: Sourcebook of Parallel Computing, pp. 491–541. Morgan Kaufmann, San Francisco (2003)
Shewchuk, J.R.: Triangle: A two-dimensional quality mesh generator and Delaunay triangulator (2007). http://www.cs.cmu.edu/~quake/triangle.html
Sinha, S., Parashar, M.: Adaptive system sensitive partitioning of amr applications on heterogeneous clusters. Clust. Comput. 5(4), 343–352 (2002)
Teresco, J.D., Faik, J., Flaherty, J.E.: Hierarchical partitioning and dynamic load balancing for scientific computation. In: Dongarra, J., Madsen, K., Wasniewski, J. (eds.) Applied Parallel Computing. Lecture Notes in Computer Science, vol. 3732, pp. 911–920. Springer, Berlin (2006). doi:10.1007/11558958
University of Paderborn: Graph partitioning—graph collection (2011). http://www2.cs.uni-paderborn.de/cs/ag-monien/RESEARCH/PART/GRAPHS/FEM2.tar
Walshaw, C.: The graph partitioning archive (2008). http://staffweb.cms.gre.ac.uk/~c.walshaw/partition/
Walshaw, C., Cross, M.: Multilevel mesh partitioning for heterogeneous communication networks. Future Gener. Comput. Syst. 17(5), 601–623 (2001). doi:10.1016/S0167-739X(00)00107-2
Walshaw, C., Cross, M.: Jostle: parallel multilevel graph-partitioning software—an overview. In: Magoules, F. (ed.) Mesh Partitioning Techniques and Domain Decomposition Techniques, pp. 27–58. Civil-Comp, Stirling (2007)
Walshaw, C., Cross, M., Diekmann, R., Preis, R.: Multilevel mesh partitioning for optimizing domain shape. Int. J. High Perform. Comput. Appl. 13(4), 334–353 (1999). doi:10.1177/109434209901300404
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). doi:10.1145/1498765.1498785
Acknowledgements
This work was supported by Postgraduate Research Fund (PS413/2010B) and Fellowship Scheme, University of Malaya, Malaysia. Initial background research used ACEnet, the regional high performance computing consortium for universities in Atlantic Canada. The authors would like to thank Abhinav Bhatelé, Zoltán Majó, and Basile Clout for answering our queries, Jian Tao Zhang for creating test instances, and anonymous reviewers for helpful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chan, S.Y., Ling, T.C. & Aubanel, E. The impact of heterogeneous multi-core clusters on graph partitioning: an empirical study. Cluster Comput 15, 281–302 (2012). https://doi.org/10.1007/s10586-012-0229-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-012-0229-4