Abstract
An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 197–206. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751226
Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.-L., Vincenti, H.: Applying the roofline performance model to the Intel Xeon Phi knights landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 339–353. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_24
Frigo, M., Strumpen, V.: The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), 93–112 (2007)
Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, pp. 66:66–66:75. ACM, New York (2014). http://doi.acm.org/10.1145/2544137.2544160
Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector simd architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 13–24. ACM, New York (2013). http://doi.acm.org/10.1145/2464996.2467268
Levchenko, V., Perepelkina, A., Zakirov, A.: Diamondtorre algorithm for high-performance wave modeling. Computation 4(3), 29 (2016). http://www.mdpi.com/2079-3197/4/3/29
Levchenko, V.: Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. Inf. Technol. Comput. Syst. (1), 68 (2005). (in Russian)
McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality. Technical report (1999). http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html
Muranushi, T., Makino, J., Hosono, N., Inoue, H., Nishizawa, S., Tomita, H., Nitadori, K., Iwasawa, M., Maruyama, Y., Yashiro, H., Nakamura, Y., Hotta, H.: Automatic generation of efficient codes from mathematical descriptions of stencil computation. In: Proceedings of the 5th International Workshop on Functional High-Performance Computing, FHPC 2016. Association for Computing Machinery (ACM) (2016). https://doi.org/10.1145/2F2975991.2975994
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5DD blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13 (2010). http://dx.doi.org/10.1109/SC.2010.2
Williams, S., Waterman, A., Patterson, D.A.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). http://dblp.uni-trier.de/db/journals/cacm/cacm52.html#WilliamsWP09
Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989. ACM, New York (1989). http://doi.acm.org/10.1145/76263.76337
Yount, C., Duran, A.: Effective use of large high-bandwidth memory caches in hpc stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016, pp. 65–75. IEEE Press, Piscataway (2016). https://doi.org/10.1109/PMBS.2016.12
Zakirov, A., Levchenko, V.D., Perepelkina, A., Yasunari, Z.: High performance fdtd code implementation for gpgpu supercomputers. Keldysh Institute Preprints (44), 22 pages (2016). http://library.keldysh.ru/preprint.asp?id=2016-44
Acknowledgments
The access to the computing resources with Intel Xeon Phi KNL has been provided by Colfax Research (colfaxresearch.com) in the course of “Deep Dive” HOW series.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Levchenko, V., Perepelkina, A. (2017). The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-62932-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62931-5
Online ISBN: 978-3-319-62932-2
eBook Packages: Computer ScienceComputer Science (R0)