Skip to main content

The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation

  • Conference paper
  • First Online:
Parallel Computing Technologies (PaCT 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10421))

Included in the following conference series:

Abstract

An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 197–206. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751226

  2. Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.-L., Vincenti, H.: Applying the roofline performance model to the Intel Xeon Phi knights landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 339–353. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_24

    Chapter  Google Scholar 

  3. Frigo, M., Strumpen, V.: The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), 93–112 (2007)

    Article  Google Scholar 

  4. Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, pp. 66:66–66:75. ACM, New York (2014). http://doi.acm.org/10.1145/2544137.2544160

  5. Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector simd architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 13–24. ACM, New York (2013). http://doi.acm.org/10.1145/2464996.2467268

  6. Levchenko, V., Perepelkina, A., Zakirov, A.: Diamondtorre algorithm for high-performance wave modeling. Computation 4(3), 29 (2016). http://www.mdpi.com/2079-3197/4/3/29

    Article  Google Scholar 

  7. Levchenko, V.: Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. Inf. Technol. Comput. Syst. (1), 68 (2005). (in Russian)

    Google Scholar 

  8. McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality. Technical report (1999). http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html

  9. Muranushi, T., Makino, J., Hosono, N., Inoue, H., Nishizawa, S., Tomita, H., Nitadori, K., Iwasawa, M., Maruyama, Y., Yashiro, H., Nakamura, Y., Hotta, H.: Automatic generation of efficient codes from mathematical descriptions of stencil computation. In: Proceedings of the 5th International Workshop on Functional High-Performance Computing, FHPC 2016. Association for Computing Machinery (ACM) (2016). https://doi.org/10.1145/2F2975991.2975994

  10. Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5DD blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13 (2010). http://dx.doi.org/10.1109/SC.2010.2

  11. Williams, S., Waterman, A., Patterson, D.A.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). http://dblp.uni-trier.de/db/journals/cacm/cacm52.html#WilliamsWP09

    Article  Google Scholar 

  12. Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989. ACM, New York (1989). http://doi.acm.org/10.1145/76263.76337

  13. Yount, C., Duran, A.: Effective use of large high-bandwidth memory caches in hpc stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016, pp. 65–75. IEEE Press, Piscataway (2016). https://doi.org/10.1109/PMBS.2016.12

  14. Zakirov, A., Levchenko, V.D., Perepelkina, A., Yasunari, Z.: High performance fdtd code implementation for gpgpu supercomputers. Keldysh Institute Preprints (44), 22 pages (2016). http://library.keldysh.ru/preprint.asp?id=2016-44

Download references

Acknowledgments

The access to the computing resources with Intel Xeon Phi KNL has been provided by Colfax Research (colfaxresearch.com) in the course of “Deep Dive” HOW series.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasia Perepelkina .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Levchenko, V., Perepelkina, A. (2017). The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62932-2_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62931-5

  • Online ISBN: 978-3-319-62932-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics