Abstract
Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy.
In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.
Chapter PDF
Similar content being viewed by others
References
Agullo, E., Coti, C., Dongarra, J., Herault, T., Langem, J.: QR factorization of tall and skinny matrices in a grid computing environment. In: Parallel Distributed Processing Symposium (IPDPS), pp. 1–11. IEEE (2010)
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM, Philadelphia (1999)
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: Scalapack: A linear algebra library for message-passing computers. In: SIAM Conference on Parallel Processing (1997)
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University (1969)
Cappello, F., Desprez, F., Dayde, M., Jeannot, E., Jegou, Y., Lanteri, S., Melab, N., Namyst, R., Primet, P.V.B., Richard, O., et al.: Grid5000: a nation wide experimental grid testbed. International Journal on High Performance Computing Applications 20(4), 481–494 (2006)
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. Technical Report UCB/EECS-2008-89, University of California Berkeley, EECS Department, LAWN #204 (2008)
Donfack, S., Grigori, L., Gupta, A.K.: Adapting communication-avoiding LU and QR factorizations to multicore architectures. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2010)
Elmroth, E., Gustavson, F.: New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems. In: Kågström, B., Elmroth, E., Waśniewski, J., Dongarra, J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 120–128. Springer, Heidelberg (1998)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th Annual Symposium on Foundations of Computer Science, pp. 285–297 (1999)
Van De Geijn, R.A., Watts, J.: SUMMA: Scalable Universal Matrix Multiplication Algorithm. Concurrency Practice and Experience 9(4), 255–274 (1997)
Graham, S.L., Snir, M., Patterson, C.A.: Getting up to speed: The future of supercomputing. National Academies Press (2005)
Grigori, L., Demmel, J., Xiang, H.: CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications 32, 1317–1350 (2011)
Grigori, L., Demmel, J.W., Xiang, H.: Communication avoiding Gaussian elimination. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 29. IEEE Press (2008)
Hong, J.-W., Kung, H.T.: I/O complexity: The red-blue pebble game. In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing. ACM (1981)
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64(9), 1017–1026 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Donfack, S., Grigori, L., Khabou, A. (2012). Avoiding Communication through a Multilevel LU Factorization. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds) Euro-Par 2012 Parallel Processing. Euro-Par 2012. Lecture Notes in Computer Science, vol 7484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32820-6_55
Download citation
DOI: https://doi.org/10.1007/978-3-642-32820-6_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32819-0
Online ISBN: 978-3-642-32820-6
eBook Packages: Computer ScienceComputer Science (R0)