skip to main content
10.1145/3385412.3385989acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Automated derivation of parametric data movement lower bounds for affine programs

Published: 11 June 2020 Publication History

Abstract

Researchers and practitioners have for long worked on improving the computational complexity of algorithms, focusing on reducing the number of operations needed to perform a computation. However the hardware trend nowadays clearly shows a higher performance and energy cost for data movements than computations: quality algorithms have to minimize data movements as much as possible.
The theoretical operational complexity of an algorithm is a function of the total number of operations that must be executed, regardless of the order in which they will actually be executed. But theoretical data movement (or, I/O) complexity is fundamentally different: one must consider all possible legal schedules of the operations to determine the minimal number of data movements achievable, a major theoretical challenge. I/O complexity has been studied via complex manual proofs, e.g., refined from Ω(n3/√S) for matrix-multiply on a cache size S by Hong & Kung to 2n3/√S by Smith et al. While asymptotic complexity may be sufficient to compare I/O potential between broadly different algorithms, the accuracy of the reasoning depends on the tightness of these I/O lower bounds. Precisely, exposing constants is essential to enable precise comparison between different algorithms: for example the 2n3/√S lower bound allows to demonstrate the optimality of panel-panel tiling for matrix-multiplication.
We present the first static analysis to automatically derive non-asymptotic parametric expressions of data movement lower bounds with scaling constants, for arbitrary affine computations. Our approach is fully automatic, assisting algorithm designers to reason about I/O complexity and make educated decisions about algorithmic alternatives.

References

[1]
Laksono Adhianto, S. Banerjee, Michael W. Fagan, Mark Krentel, Gabriel Marin, John M. Mellor-Crummey, and Nathan R. Tallent. 2010.
[2]
HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685–701.
[3]
Alok Aggarwal and Jeffrey S. Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31 (1988), 1116–1127.
[4]
[5]
Grey Ballard, Erin Carson, James Demmel, Mark Hoemmen, Nick Knight, and Oded Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23 (2014), 1–155.
[6]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011.
[7]
Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.
[8]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2012.
[9]
Graph expansion and communication costs of fast matrix multiplication. J. ACM 59, 6 (2012), 32.
[10]
Alexander I. Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19, 4 (1994), 769–779.
[11]
Christian Bauer, Alexander Frink, and Richard Kreckel. 2002. Introduction to the GiNaC Framework for Symbolic Computation within the C++ Programming Language. J. Symbolic Computation 33 (2002), 1–12.
[12]
Gianfranco Bilardi and Enoch Peserico. 2001.
[13]
A characterization of temporal locality and its portability across memory hierarchies. Automata, Languages and Programming (2001), 128–139.
[14]
Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A Lower Bound Technique for Communication on BSP with Application to the FFT. In Euro-Par 2012 Parallel Processing - 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, 2012. Proceedings. 676–687.
[15]
Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013.
[16]
Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays — Part 1. EECS Technical Report EECS–2013-61. UC Berkeley.
[17]
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012.
[18]
Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM J. Scientific Computing 34, 1 (2012), A206–A239.
[19]
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2014.
[20]
On characterizing the data movement complexity of computational DAGs for parallel execution. In Proc. of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14, Prague, Czech Republic - June 23 - 25, 2014. 296–306.
[21]
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2015.
[22]
On Characterizing the Data Access Complexity of Programs. In Proc. of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015. 567–580.
[23]
Paul Feautrier. 1988.
[24]
Parametric integer programming. RAIRO Recherche Opérationnelle 22, 3 (1988), 243–268.
[25]
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International Journal of Parallel Programming 21, 5 (1992), 313–347.
[26]
Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing. 1581–1592.
[27]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-Oblivious Algorithms. In Proc. of the 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, 17-18 October, 1999, New York, NY, USA. 285–298.
[28]
Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In Proc. of the 13th Annual ACM Symposium on Theory of Computing (STOC ’81), May 11-13, 1981, Milwaukee, Wisconsin, USA. 326–333.
[29]
Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel and Distrib. Comput. 64, 9 (2004), 1017–1026.
[30]
Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019. 24:1–24:22.
[31]
Lynn H. Loomis and Hassler Whitney. 1949. An inequality related to the isoperimetric inequality. Bull. Am. Math. Soc. 55 (1949), 961–962.
[32]
Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2019. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs. arXiv: cs.CC/1911.06664
[33]
Louis-Noël Pouchet and Tomofumi Yuki. 2015.
[34]
PolyBench/C 4.2. http://polybench.sf.net/.
[35]
J. Ramanujam and P. Sadayappan. 1992.
[36]
Tiling multidimensional iteration spaces for multicomputers. J. Parallel and Distrib. Comput. 16, 2 (1992), 108–230.
[37]
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2010.
[38]
Upper and Lower I/O Bounds for Pebbling r-Pyramids. In Combinatorial Algorithms - 21st International Workshop, IWOCA 2010, London, UK, July 26-28, 2010, Revised Selected Papers. 107–120.
[39]
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2011.
[40]
Strong I/O Lower Bounds for Binomial and FFT Computation Graphs. In Computing and Combinatorics. LNCS, Vol. 6842. 134–145.
[41]
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and lower I/O bounds for pebbling r-pyramids. J. Discrete Algorithms 14 (2012), 2–12.
[42]
John E. Savage. 1995.
[43]
Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics. LNCS, Vol. 959. 270–281.
[44]
John E. Savage and Mohammad Zubair. 2008.
[45]
A unified model for multicore architectures. In Proc. of the 1st international forum on Next-generation multicore/manycore technologies, IFMT 2008, Cairo, Egypt, November 24-25, 2008. 9.
[46]
Tyler Michael Smith, Bradley Lowery, Julien Langou, and Robert A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. arXiv: 1702.02017v2
[47]
Volker Strassen. 1969. Gaussian elimination is not optimal. Numerische mathematik 13, 4 (1969), 354–356.
[48]
Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Mathematical Software–ICMS 2010. 299–302.
[49]
Sven Verdoolaege. 2018.
[50]
Integer Set Library: Manual. http://isl.gforge.inria.fr/manual.pdf.
[51]
Sven Verdoolaege and Tobias Grosser. 2012.
[52]
Polyhedral Extraction Tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12).
[53]
Samuel Williams, Andrew Waterman, and David Patterson. 2009.

Cited By

View all
  • (2025)An Approach to Tight I/O Lower Bounds for Algorithms with Composite ProceduresComputing and Combinatorics10.1007/978-981-96-1093-8_13(152-163)Online publication date: 20-Feb-2025
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2024)The Droplet Search Algorithm for Kernel SchedulingACM Transactions on Architecture and Code Optimization10.1145/365010921:2(1-28)Online publication date: 21-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2020
1174 pages
ISBN:9781450376136
DOI:10.1145/3385412
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Affine programs
  2. Data access complexity
  3. I/O lower bounds
  4. Static analysis

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation

Conference

PLDI '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)181
  • Downloads (Last 6 weeks)30
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)An Approach to Tight I/O Lower Bounds for Algorithms with Composite ProceduresComputing and Combinatorics10.1007/978-981-96-1093-8_13(152-163)Online publication date: 20-Feb-2025
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2024)The Droplet Search Algorithm for Kernel SchedulingACM Transactions on Architecture and Code Optimization10.1145/365010921:2(1-28)Online publication date: 21-May-2024
  • (2024)Formal Verification of Source-to-Source Transformations for HLSProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637563(97-107)Online publication date: 1-Apr-2024
  • (2024)Tightening I/O Lower Bounds through the Hourglass Dependency PatternProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659986(183-193)Online publication date: 17-Jun-2024
  • (2024)Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00021(150-166)Online publication date: 29-Jun-2024
  • (2023)Cache Programming for Scientific Loops Using LeasesACM Transactions on Architecture and Code Optimization10.1145/360009020:3(1-25)Online publication date: 19-Jul-2023
  • (2023)Parallel Memory-Independent Communication Bounds for SYRKProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591072(391-401)Online publication date: 17-Jun-2023
  • (2023)Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00047(390-401)Online publication date: May-2023
  • (2022)Beyond time complexityProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532395(1-12)Online publication date: 28-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media