Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations

Ono, Kenji; Chiba, Shuichi; Inoue, Shunsuke; Minami, Kazuo

doi:10.1007/978-3-319-17353-5_17

Kenji Ono^16,17,18,
Shuichi Chiba¹⁹,
Shunsuke Inoue¹⁹ &
…
Kazuo Minami¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8969))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

746 Accesses
1 Citations

Abstract

Practical simulators require high-performance iterative methods and efficient boundary conditions, especially in the field of computational fluid dynamics. In this paper, we propose a novel bit-representation technique to enhance the performance of such simulators. The technique is applied to an iterative kernel implementation that treats various boundary conditions in a stencil computation on a structured grid system. This approach reduces traffic from the main memory to CPU, and effectively utilizes Single Instruction–Multiple Data (SIMD) stream units with cache because of the bit-representation and compression of matrix elements. The proposed implementation also replaces if-branch statements with mask operations using the bit expression. This promotes the optimization of code during compilation and runtime. To evaluate the performance of the proposed implementation, we employ the Red–Black SOR and BiCGstab algorithms. Experimental results show that the proposed approach is up to 3.5 times faster than a naïve implementation on both the Intel and Fujitsu Sparc architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here, we count 1 flop for multiplication, addition, and subtraction operators, and 8 flops for the division operator.
2.
The number of loads depends on the size of the cache line. In this case, there are 3 loads for array p owing to the L3 cache.

References

Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Willcock, J., Lumsdaine, A.: Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th Annual ICS 2006, pp. 307–316 (2006)
Google Scholar
Tang, W.T., et al.: Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In: Proceedings of SC 2013, vol. 26, pp. 1–12 (2013)
Google Scholar
Van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Article MATH Google Scholar
Yokokawa, M.: Vector-parallel processing of the successive overrelaxation method. Japan Atomic Energy Research Institute JAERI-M Report No. 88–017 (1988) (in Japanese)
Google Scholar
Ono, K., Kawashima, Y., Kawanabe, T.: Data centric framework for large-scale high-performance parallel computation. Procedia Comput. Sci. 29, 2336–2350 (2014)
Article Google Scholar
http://avr-aics-riken.github.io/ffvc_package/
http://avr-aics-riken.github.io/PMlib/
http://www.cs.virginia.edu/stream

Download references

Acknowledgments

We thank the RIKEN Advanced Institute for Computational Science for allowing us to use the K computer to obtain our results. Part of this research was supported by a grant for the “Strategic Program on HPCI Field No. 4: Industrial Innovations” from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) “Development and Use of Advanced, High-Performance, General-Purpose Supercomputers Project,” and was carried out in partnership with the University of Tokyo.

Author information

Authors and Affiliations

RIKEN, Advanced Institute for Computational Science, 7-1-26, Minatojima-minami-machi, Chuo-ku, Kobe, 650-0047, Japan
Kenji Ono & Kazuo Minami
Graduate School of System Informatics, Kobe University, Kobe, Japan
Kenji Ono
Institute of Industrial Science, University of Tokyo, Bunkyō, Japan
Kenji Ono
Fujitsu Limited, Tokyo, Japan
Shuichi Chiba & Shunsuke Inoue

Authors

Kenji Ono
View author publications
You can also search for this author in PubMed Google Scholar
Shuichi Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Inoue
View author publications
You can also search for this author in PubMed Google Scholar
Kazuo Minami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenji Ono .

Editor information

Editors and Affiliations

IRIT, ENSEEIHT, Toulouse Cedex, France
Michel Daydé
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Osni Marques
Information Technology Center, The University of Tokyo, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ono, K., Chiba, S., Inoue, S., Minami, K. (2015). Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-17353-5_17
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics