skip to main content
10.1145/2451436.2451440acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodularityConference Proceedingsconference-collections
research-article

KFusion: optimizing data flow without compromising modularity

Published: 24 March 2013 Publication History

Abstract

Programming language support for multi-core architectures introduces a fundamentally new mechanism for modularity---a kernel. Though it can be used as a means to separate concerns, a kernel is given a clean slate of memory at execution time. As a consequence, application developers attempting to leverage libraries of kernels often incur substantial unanticipated performance penalties. Currently, the only recourse is to compromise modularity for the sake of optimizing data flow on an application-specific basis.
KFusion is our prototype tool for optimizing libraries of kernels according to application-specific needs. Our goal is to shield application developers from loop fusion and deforestation in compositions of low level kernels that share data. Libraries, augmented by domain experts with annotations to ensure correct compositions of kernels, provide application developers with the opportunity to supply hints according to customized data flow needs---keeping modularity intact. In the worst case, an inaccurate hint incurs no penalty. Case studies of applications using general-purpose libraries for linear algebra, image manipulation and physics engines show that KFusion can substantially improve performance associated memory bandwidth bottlenecks.

References

[1]
D. F. Bacon, S. L. Graham, Oliver, and J. Sharp. Compiler Transformations for High-Performance Computing. ACM Computing Surveys, 26:345--420, 1994.
[2]
M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In R. Gupta, editor, Compiler Construction, volume 6011 of Lecture Notes in Computer Science, chapter 14, pages 244--263. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2010.
[3]
M. G. Burke and R. K. Cytron. Interprocedural Dependence Analysis and Parallelization. SIGPLAN Not., 39(4):139--154, Apr. 2004.
[4]
H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A Domain-specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 35--46, New York, NY, USA, 2011. ACM.
[5]
L. Correnson, E. Duris, D. Parigot, and G. Roussel. Declarative Program Transformation: a Deforestation Case-study, 1999.
[6]
L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5:46--55, January 1998.
[7]
O.-J. Dahl and K. Nygaard. SIMULA. In Encyclopedia of Computer Science, pages 1576--1578. John Wiley and Sons Ltd., Chichester, UK, 2003.
[8]
U. Dastgeer, J. Enmyren, and C. W. Kessler. Auto-tuning SkePU: a Multi-backend Skeleton Programming Framework for Multi-GPU Systems. In Proceedings of the 4th International Workshop on Multicore Software Engineering, IWMSE '11, pages 25--32, New York, NY, USA, 2011. ACM.
[9]
E. Dijkstra. A Discipline Of Programming. Prentice-Hall series in automatic computation. Prentice-Hall, 1976.
[10]
E. W. Dijkstra. The structure of the \THE" multiprogramming system. In Proceedings of the first ACM symposium on Operating System Principles, SOSP '67, pages 10.1--10.6, New York, NY, USA, 1967. ACM.
[11]
M. Frigo, Steven, and G. Johnson. The Design and Implementation of FFTW3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.
[12]
Intel. Intel Core i7-2600K Processor (8M Cache, 3.40 GHz). http://ark.intel.com/products/52214/Intel-Core-i7-2600K-Processor-%288M-Cache-3 40-GHz%29, 2011. date accessed: July 2011.
[13]
K. Kennedy and K. McKinley. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 301--320. Springer Berlin / Heidelberg, 1994. 10.1007/3-540-57659-2 18.
[14]
Khronos. OpenCL. http://www.khronos.org/opencl/, 2011.
[15]
G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. V. Lopes, J.-M. Loingtier, and J. Irwin. Aspect-Oriented Programming. In ECOOP, pages 220--242, 1997.
[16]
L. Lamport. The Parallel Execution of DO Loops. Commun. ACM, 17(2):83--93, Feb. 1974.
[17]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '09, pages 101--110, New York, NY, USA, 2009. ACM.
[18]
Leung, Alan and Lhotak, Ondrej and Lashari, Ghulam. Automatic parallelization for Graphics Processing Units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ '09, pages 91--100, New York, NY, USA, 2009. ACM.
[19]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, Mar. 2008.
[20]
B. Liskov. In The Second ACM SIGPLAN Conference on History of Programming Languages April 20 -- 23, 1993, Cambridge, United States, New York.
[21]
C. Nugteren and H. Corporaal. Introducing 'Bones': A Parallelizing Source-to-source Compiler Based on Algorithmic Skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 1--10, New York, NY, USA, 2012. ACM.
[22]
NVIDIA. NVIDIA CUDA Programming Guide 2.0. 2008.
[23]
NVIDIA. CUDA Zone. www.nvidia.com/object/cuda home.html, 2010. date accessed: March 2009.
[24]
M. Odersky and al. An Overview of the Scala Programming Language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004.
[25]
D. L. Parnas. On the Criteria To Be Used in Decomposing Systems into Modules. Communications of the ACM, 15:1053--1058, 1972.
[26]
M. Püschel, J. M. F. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Int. J. High Perform. Comput. Appl., 18.
[27]
A. Resios. GPU Performance Prediction using Parametrized Models. Master's thesis, Utrecht University, The Netherlands, 2011.
[28]
K. Rupp, J. Weinbub, and F. Rudolf. Automatic Performance Optimization in ViennaCL for GPUs. In Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing, POOSC '10, pages 6:1--6:6, New York, NY, USA, 2010. ACM.
[29]
E.-M. Sha, C. Lang, and N. Passos. Polynomial-Time Nested Loop Fusion with Full Parallelism. In Parallel Processing, 1996. Vol.3. Software., Proceedings of the 1996 International Conference on, volume 3, pages 9--16 vol.3, aug 1996.
[30]
H. Sutter. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3):202--210, 2005.
[31]
P. Wadler. Deforestation: Transforming Programs to Eliminate Trees. In Proceedings of the 2nd European Symposium on Programming, ESOP '88, pages 344--358, London, UK, UK, 1988. Springer-Verlag.
[32]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for Memory Optimization and Parallelism Management. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.

Cited By

View all
  • (2015)Kernel composition in SYCLProceedings of the 3rd International Workshop on OpenCL10.1145/2791321.2791332(1-7)Online publication date: 12-May-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AOSD '13: Proceedings of the 12th annual international conference on Aspect-oriented software development
March 2013
232 pages
ISBN:9781450317665
DOI:10.1145/2451436
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • AOSA: Aspect-Oriented Software Association

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OpenCL
  2. modularity
  3. parallelism
  4. performance

Qualifiers

  • Research-article

Conference

AOSD '13
Sponsor:
  • AOSA
AOSD '13: Aspect-Oriented Software Development
March 24 - 29, 2013
Fukuoka, Japan

Acceptance Rates

Overall Acceptance Rate 41 of 139 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Kernel composition in SYCLProceedings of the 3rd International Workshop on OpenCL10.1145/2791321.2791332(1-7)Online publication date: 12-May-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media