The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation

Levchenko, Vadim; Perepelkina, Anastasia

doi:10.1007/978-3-319-62932-2_11

Vadim Levchenko¹⁴ &
Anastasia Perepelkina¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10421))

Included in the following conference series:

International Conference on Parallel Computing Technologies

1135 Accesses
3 Citations

Abstract

An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 197–206. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751226
Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.-L., Vincenti, H.: Applying the roofline performance model to the Intel Xeon Phi knights landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 339–353. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_24
Chapter Google Scholar
Frigo, M., Strumpen, V.: The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), 93–112 (2007)
Article Google Scholar
Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, pp. 66:66–66:75. ACM, New York (2014). http://doi.acm.org/10.1145/2544137.2544160
Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector simd architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 13–24. ACM, New York (2013). http://doi.acm.org/10.1145/2464996.2467268
Levchenko, V., Perepelkina, A., Zakirov, A.: Diamondtorre algorithm for high-performance wave modeling. Computation 4(3), 29 (2016). http://www.mdpi.com/2079-3197/4/3/29
Article Google Scholar
Levchenko, V.: Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. Inf. Technol. Comput. Syst. (1), 68 (2005). (in Russian)
Google Scholar
McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality. Technical report (1999). http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html
Muranushi, T., Makino, J., Hosono, N., Inoue, H., Nishizawa, S., Tomita, H., Nitadori, K., Iwasawa, M., Maruyama, Y., Yashiro, H., Nakamura, Y., Hotta, H.: Automatic generation of efficient codes from mathematical descriptions of stencil computation. In: Proceedings of the 5th International Workshop on Functional High-Performance Computing, FHPC 2016. Association for Computing Machinery (ACM) (2016). https://doi.org/10.1145/2F2975991.2975994
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5DD blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13 (2010). http://dx.doi.org/10.1109/SC.2010.2
Williams, S., Waterman, A., Patterson, D.A.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). http://dblp.uni-trier.de/db/journals/cacm/cacm52.html#WilliamsWP09
Article Google Scholar
Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989. ACM, New York (1989). http://doi.acm.org/10.1145/76263.76337
Yount, C., Duran, A.: Effective use of large high-bandwidth memory caches in hpc stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016, pp. 65–75. IEEE Press, Piscataway (2016). https://doi.org/10.1109/PMBS.2016.12
Zakirov, A., Levchenko, V.D., Perepelkina, A., Yasunari, Z.: High performance fdtd code implementation for gpgpu supercomputers. Keldysh Institute Preprints (44), 22 pages (2016). http://library.keldysh.ru/preprint.asp?id=2016-44

Download references

Acknowledgments

The access to the computing resources with Intel Xeon Phi KNL has been provided by Colfax Research (colfaxresearch.com) in the course of “Deep Dive” HOW series.

Author information

Authors and Affiliations

Keldysh Institute of Applied Mathematics RAS, Miusskaya sq., 4, Moscow, Russia
Vadim Levchenko & Anastasia Perepelkina

Authors

Vadim Levchenko
View author publications
You can also search for this author in PubMed Google Scholar
Anastasia Perepelkina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasia Perepelkina .

Editor information

Editors and Affiliations

Russian Academy of Sciences, Novosibirsk, Russia
Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Levchenko, V., Perepelkina, A. (2017). The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-62932-2_11
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62931-5
Online ISBN: 978-3-319-62932-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics