Skip to main content

Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations

  • Conference paper
  • First Online:
High Performance Computing for Computational Science -- VECPAR 2014 (VECPAR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8969))

Abstract

Practical simulators require high-performance iterative methods and efficient boundary conditions, especially in the field of computational fluid dynamics. In this paper, we propose a novel bit-representation technique to enhance the performance of such simulators. The technique is applied to an iterative kernel implementation that treats various boundary conditions in a stencil computation on a structured grid system. This approach reduces traffic from the main memory to CPU, and effectively utilizes Single Instruction–Multiple Data (SIMD) stream units with cache because of the bit-representation and compression of matrix elements. The proposed implementation also replaces if-branch statements with mask operations using the bit expression. This promotes the optimization of code during compilation and runtime. To evaluate the performance of the proposed implementation, we employ the Red–Black SOR and BiCGstab algorithms. Experimental results show that the proposed approach is up to 3.5 times faster than a naïve implementation on both the Intel and Fujitsu Sparc architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, we count 1 flop for multiplication, addition, and subtraction operators, and 8 flops for the division operator.

  2. 2.

    The number of loads depends on the size of the cache line. In this case, there are 3 loads for array p owing to the L3 cache.

References

  1. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

  2. Willcock, J., Lumsdaine, A.: Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th Annual ICS 2006, pp. 307–316 (2006)

    Google Scholar 

  3. Tang, W.T., et al.: Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In: Proceedings of SC 2013, vol. 26, pp. 1–12 (2013)

    Google Scholar 

  4. Van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)

    Article  MATH  Google Scholar 

  5. Yokokawa, M.: Vector-parallel processing of the successive overrelaxation method. Japan Atomic Energy Research Institute JAERI-M Report No. 88–017 (1988) (in Japanese)

    Google Scholar 

  6. Ono, K., Kawashima, Y., Kawanabe, T.: Data centric framework for large-scale high-performance parallel computation. Procedia Comput. Sci. 29, 2336–2350 (2014)

    Article  Google Scholar 

  7. http://avr-aics-riken.github.io/ffvc_package/

  8. http://avr-aics-riken.github.io/PMlib/

  9. http://www.cs.virginia.edu/stream

Download references

Acknowledgments

We thank the RIKEN Advanced Institute for Computational Science for allowing us to use the K computer to obtain our results. Part of this research was supported by a grant for the “Strategic Program on HPCI Field No. 4: Industrial Innovations” from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) “Development and Use of Advanced, High-Performance, General-Purpose Supercomputers Project,” and was carried out in partnership with the University of Tokyo.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenji Ono .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ono, K., Chiba, S., Inoue, S., Minami, K. (2015). Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17353-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17352-8

  • Online ISBN: 978-3-319-17353-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics