Abstract
In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO [6], combined with a highly optimized fine grain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intel Xeon Phi board up to 32.25 % against instances produced by state-of-the-art compiler frameworks for selected stencil applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Where \(m\) is less or equal to the number of dimensions of the iteration space.
- 2.
The parallel hyperplane.
- 3.
Where \(n\) is the size of a dimension in the iteration space. For our example, both dimensions are the same.
References
perf: Linux profiling with performance counters
Bandishti, V., Pananilath, I., Bondhugula, U.: Tiling stencil computations to maximize parallelism. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, Los Alamitos, CA, USA, pp. 40:1–40:11 (2012)
Baskaran, M.M., et al.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–10. ACM (2008)
Bastoul, C.: Generating loops for scanning polyhedra: cloog users guide. Polyhedron 2, 10 (2004)
Bikshandi, G., et al.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 48–57. ACM, New York (2006)
Bondhugula, U., Ramanujam, J.: Pluto: a practical and fully automatic polyhedral parallelizer and locality optimizer (2007)
Intel Open Source Technology Center. Open community runtime (2012)
Cepeda, S.: Optimization and performance tuning for Intel Xeon Phi coprocessors, part 2: understanding and using hardware events (2012)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. Siam Rev. (2008)
Dursun, H., et al.: Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. J. Supercomput. 62(2), 946–966 (2012)
Feautrier, P.: Some efficient solutions to the affine scheduling problem. i. one-dimensional time. Int. J. Parallel Program. 21(5), 313–347 (1992)
Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. Int. J. Parallel Program. 21(6), 389–420 (1992)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, p. 285. IEEE Computer Society, Washington, DC (1999)
Gan, G., Wang, X., Manzano, J., Gao, G.R.: Tile percolation: an OpenMP tile aware parallelization technique for the cyclops-64 multicore processor. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 839–850. Springer, Heidelberg (2009)
Griebl, M., Lengauer, C., Wetzel, S.: Code generation in the polytope model. In: Proceedings 1998 International Conference on Parallel Architectures and Compilation Techniques, pp. 106–111. IEEE (1998)
Grosser, T., Verdoolaege, S., Cohen, A., Sadayappan, P.: The relation between diamond tiling and hexagonal tiling. In: HiStencils 2014, p. 65 (2014)
Högstedt, K., Carter, L., Ferrante, J.: Selecting tile shape for minimal execution time. In: Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 201–211. ACM (1999)
ET International. Swarm (swift adaptive runtime machine) (2012)
Kim, D., et al.: Physical experimentation with prefetching helper threads on intel’s hyper-threaded processors. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO 2004, p. 27. IEEE Computer Society, Washington, DC (2004)
Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking, pp. 346–357 (1997)
Lewis, J., et al.: An automatic prefetching and caching system. In: 2010 IEEE 29th International Performance Computing and Communications Conference (IPCCC), pp. 180–187, December 2010
Massachusetts Institute of Technology: Laboratory for Computer Science and D.O.J. Tanguay. Compile-time Loop Splitting for Distributed Memory Multiprocessors. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science (1993)
Theobald, K.B.: Earth: An Efficient Architecture for Running Threads. McGill University, Montreal (1999)
Wilde, D.K.: A library for doing polyhedral operations, Technical report (1997)
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991)
Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989, pp. 655–664. ACM, New York (1989)
Wolfe, M.: Iteration space tiling for memory hierarchies. In: Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pp. 357–361. Society for Industrial and Applied Mathematics, Philadelphia (1989)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Shrestha, S., Manzano, J., Marquez, A., Feo, J., Gao, G.R. (2015). Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-17473-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)