Abstract
This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
For practical purposes, the algorithm is often used with preconditioners but this is not in the scope of this paper.
References
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatooh RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks. Int J High Perform Comput Appl 5:63–73
Dongarra J, Heroux MA (2013) Toward a new metric for ranking high performance computing systems. Technical Report SAND2013-4744, Sandia National Laboratories, USA
Petitet A, Whaley RC, Dongarra J, Cleary A (2014) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl (Last visit August 2014)
Top500 Supercomputer Sites. http://top500.org/ (Last visit August 2014)
Berkeley UPC Project. http://upc.lbl.gov (Last visit August 2014)
El-Ghazawi T, Carlson W, Sterling T, Yelick K (2003) UPC: distributed shared-memory programming. Wiley-Interscience, Hoboken
Mallón DA, Gómez A, Mouriño JC, Taboada GL, Teijeiro C, Touriño J, Fraguela BB, Doallo R, Wibecan B (2009) UPC performance evaluation on a multicore system. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09). Ashburn, Virginia, USA
Shan H, Blagojević F, Min SJ, Hargrove P, Jin H, Fuerlinger K, Koniges A, Wright NJ (2010) A programming model performance study using the NAS parallel benchmarks. Sci Program 18(3–4):153–167
Zheng Y (2010) Optimizing UPC programs for multi-core systems. Sci Program 18(3–4):183–191
The DEGAS Project. https://www.xstackwiki.com/index.php/DEGAS (Last visit August 2014)
Chen WY, Bonachea D, Duell J, Husbands P, Iancu C, Yelick K (2003) A performance analysis of the Berkeley UPC compiler. In: Proceedings of the 17th international conference on supercomputing (ICS’03), San Francisco, CA, USA, pp 63–73
El-Ghazawi T, Cantonnet F (2002) UPC performance and potential: a NPB experimental study. In: Proceedings of the 14th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’02), Baltimore, MD, USA, pp 1–26
Jin H, Hood R, Mehrotra P (2009) A practical study of UPC using the NAS parallel benchmarks. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09), Ashburn, Virginia, USA
Vuduc R, Demmel JW, Yelick KA (2005) OSKI: a library of automatically tuned sparse matrix kernels. J Phys Conf Ser 16(1):521–530
Pichel JC, Heras DB, Cabaleiro JC, García-Loureiro AJ, Rivera FF (2010) Increasing the locality of iterative methods and its application to the simulation of semiconductor devices. Int J High Perform Comput Appl 24(2):136–153
Belgin M, Back G, Ribbens CJ (2009) Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In: Proceedings of the 23rd international conference on supercomputing (ICS’09), Yorktown Heights, NY, USA, pp 100–109
Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix–vector multiplication using index and value compression. In: Proceedings of the 5th conference on computing frontiers (CF’08), Ischia, Italy, pp 87–96
Kourtis K, Karakasis V, Goumas G, Kozirisl N (2011) CSX: an extended compression format for SpMV on shared memory systems. In: Proceedings of the 16th ACM SIGPLAN annual symposium on principles and practice of parallel programming (PPoPP’11), San Antonio, TX, USA, pp 12–16
Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th international conference on supercomputing (ICS’06), Cairns, Australia, pp 307–316
González-Domínguez J, García-López O, Taboada GL, Martín MJ, Touriño J (2012) Performance evaluation of sparse matrix products in UPC. J Supercomput 64(1):63–73
Ismail L (2010) Communication issues in parallel conjugate gradient method using a star-based network. In: Proceedings of the 1st international conference on computer applications and industrial electronics (ICCAIE’10), Kuala Lumpur, Malaysia
Chen F, Theobald KB, Gao GR (2004) Implementing parallel conjugate gradient on the EARTH multithreaded architecture. In: Proceedings of the 6th IEEE international conference on cluster computing (CLUSTER’04), San Diego, CA, USA, pp 459–469
Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, 2nd edn. SIAM, Philadelphia
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Williams S, Oliker L, Vuduc RW, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 19th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’07), Reno, NV, USA
The University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices (Last visit August 2014)
Acknowledgments
This work was funded by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P), by the Galician Government (Consolidation Program of Competitive Reference Groups GRC2013/055) and by the U.S. Department of Energy (Contract No. DE-AC03-76SF00098).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
González-Domínguez, J., Marques, O.A., Martín, M.J. et al. A 2D algorithm with asymmetric workload for the UPC conjugate gradient method. J Supercomput 70, 816–829 (2014). https://doi.org/10.1007/s11227-014-1300-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1300-0