Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

Shrestha, Sunil; Manzano, Joseph; Marquez, Andres; Feo, John; Gao, Guang R.

doi:10.1007/978-3-319-17473-0_11

Sunil Shrestha¹⁵,
Joseph Manzano¹⁶,
Andres Marquez¹⁶,
John Feo¹⁶ &
…
Guang R. Gao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8967))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

Abstract

In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO [6], combined with a highly optimized fine grain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intel Xeon Phi board up to 32.25 % against instances produced by state-of-the-art compiler frameworks for selected stencil applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Revisiting split tiling for stencil computations in polyhedral compilation

Article 27 May 2021

Notes

1.
Where $m$ is less or equal to the number of dimensions of the iteration space.
2.
The parallel hyperplane.
3.
Where $n$ is the size of a dimension in the iteration space. For our example, both dimensions are the same.

References

perf: Linux profiling with performance counters
Google Scholar
Bandishti, V., Pananilath, I., Bondhugula, U.: Tiling stencil computations to maximize parallelism. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, Los Alamitos, CA, USA, pp. 40:1–40:11 (2012)
Google Scholar
Baskaran, M.M., et al.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–10. ACM (2008)
Google Scholar
Bastoul, C.: Generating loops for scanning polyhedra: cloog users guide. Polyhedron 2, 10 (2004)
Google Scholar
Bikshandi, G., et al.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 48–57. ACM, New York (2006)
Google Scholar
Bondhugula, U., Ramanujam, J.: Pluto: a practical and fully automatic polyhedral parallelizer and locality optimizer (2007)
Google Scholar
Intel Open Source Technology Center. Open community runtime (2012)
Google Scholar
Cepeda, S.: Optimization and performance tuning for Intel Xeon Phi coprocessors, part 2: understanding and using hardware events (2012)
Google Scholar
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. Siam Rev. (2008)
Google Scholar
Dursun, H., et al.: Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. J. Supercomput. 62(2), 946–966 (2012)
Article MathSciNet Google Scholar
Feautrier, P.: Some efficient solutions to the affine scheduling problem. i. one-dimensional time. Int. J. Parallel Program. 21(5), 313–347 (1992)
Article MATH MathSciNet Google Scholar
Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. Int. J. Parallel Program. 21(6), 389–420 (1992)
Article MATH MathSciNet Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, p. 285. IEEE Computer Society, Washington, DC (1999)
Google Scholar
Gan, G., Wang, X., Manzano, J., Gao, G.R.: Tile percolation: an OpenMP tile aware parallelization technique for the cyclops-64 multicore processor. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 839–850. Springer, Heidelberg (2009)
Chapter Google Scholar
Griebl, M., Lengauer, C., Wetzel, S.: Code generation in the polytope model. In: Proceedings 1998 International Conference on Parallel Architectures and Compilation Techniques, pp. 106–111. IEEE (1998)
Google Scholar
Grosser, T., Verdoolaege, S., Cohen, A., Sadayappan, P.: The relation between diamond tiling and hexagonal tiling. In: HiStencils 2014, p. 65 (2014)
Google Scholar
Högstedt, K., Carter, L., Ferrante, J.: Selecting tile shape for minimal execution time. In: Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 201–211. ACM (1999)
Google Scholar
ET International. Swarm (swift adaptive runtime machine) (2012)
Google Scholar
Kim, D., et al.: Physical experimentation with prefetching helper threads on intel’s hyper-threaded processors. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO 2004, p. 27. IEEE Computer Society, Washington, DC (2004)
Google Scholar
Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking, pp. 346–357 (1997)
Google Scholar
Lewis, J., et al.: An automatic prefetching and caching system. In: 2010 IEEE 29th International Performance Computing and Communications Conference (IPCCC), pp. 180–187, December 2010
Google Scholar
Massachusetts Institute of Technology: Laboratory for Computer Science and D.O.J. Tanguay. Compile-time Loop Splitting for Distributed Memory Multiprocessors. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science (1993)
Google Scholar
Theobald, K.B.: Earth: An Efficient Architecture for Running Threads. McGill University, Montreal (1999)
Google Scholar
Wilde, D.K.: A library for doing polyhedral operations, Technical report (1997)
Google Scholar
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991)
Google Scholar
Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989, pp. 655–664. ACM, New York (1989)
Google Scholar
Wolfe, M.: Iteration space tiling for memory hierarchies. In: Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pp. 357–361. Society for Industrial and Applied Mathematics, Philadelphia (1989)
Google Scholar

Download references

Author information

Authors and Affiliations

CAPSL, University of Delaware, Newark, DE, 19716, USA
Sunil Shrestha & Guang R. Gao
Pacific Northwest National Laboratory, Richland, WA, 99354, USA
Joseph Manzano, Andres Marquez & John Feo

Authors

Sunil Shrestha
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Manzano
View author publications
You can also search for this author in PubMed Google Scholar
Andres Marquez
View author publications
You can also search for this author in PubMed Google Scholar
John Feo
View author publications
You can also search for this author in PubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunil Shrestha .

Editor information

Editors and Affiliations

Intel Corporation, Santa Clara, California, USA
James Brodman
Intel Corporation, Santa Clara, California, USA
Peng Tu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shrestha, S., Manzano, J., Marquez, A., Feo, J., Gao, G.R. (2015). Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-17473-0_11
Published: 01 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Revisiting split tiling for stencil computations in polyhedral compilation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Revisiting split tiling for stencil computations in polyhedral compilation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation