research-article

KFusion: optimizing data flow without compromising modularity

Authors:

Aaron Gulliver,

Yvonne CoadyAuthors Info & Claims

AOSD '13: Proceedings of the 12th annual international conference on Aspect-oriented software development

Pages 25 - 36

https://doi.org/10.1145/2451436.2451440

Published: 24 March 2013 Publication History

Abstract

Programming language support for multi-core architectures introduces a fundamentally new mechanism for modularity---a kernel. Though it can be used as a means to separate concerns, a kernel is given a clean slate of memory at execution time. As a consequence, application developers attempting to leverage libraries of kernels often incur substantial unanticipated performance penalties. Currently, the only recourse is to compromise modularity for the sake of optimizing data flow on an application-specific basis.

KFusion is our prototype tool for optimizing libraries of kernels according to application-specific needs. Our goal is to shield application developers from loop fusion and deforestation in compositions of low level kernels that share data. Libraries, augmented by domain experts with annotations to ensure correct compositions of kernels, provide application developers with the opportunity to supply hints according to customized data flow needs---keeping modularity intact. In the worst case, an inaccurate hint incurs no penalty. Case studies of applications using general-purpose libraries for linear algebra, image manipulation and physics engines show that KFusion can substantially improve performance associated memory bandwidth bottlenecks.

References

[1]

D. F. Bacon, S. L. Graham, Oliver, and J. Sharp. Compiler Transformations for High-Performance Computing. ACM Computing Surveys, 26:345--420, 1994.

Digital Library

[2]

M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In R. Gupta, editor, Compiler Construction, volume 6011 of Lecture Notes in Computer Science, chapter 14, pages 244--263. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2010.

Digital Library

[3]

M. G. Burke and R. K. Cytron. Interprocedural Dependence Analysis and Parallelization. SIGPLAN Not., 39(4):139--154, Apr. 2004.

Digital Library

[4]

H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A Domain-specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 35--46, New York, NY, USA, 2011. ACM.

Digital Library

[5]

L. Correnson, E. Duris, D. Parigot, and G. Roussel. Declarative Program Transformation: a Deforestation Case-study, 1999.

[6]

L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5:46--55, January 1998.

Digital Library

[7]

O.-J. Dahl and K. Nygaard. SIMULA. In Encyclopedia of Computer Science, pages 1576--1578. John Wiley and Sons Ltd., Chichester, UK, 2003.

[8]

U. Dastgeer, J. Enmyren, and C. W. Kessler. Auto-tuning SkePU: a Multi-backend Skeleton Programming Framework for Multi-GPU Systems. In Proceedings of the 4th International Workshop on Multicore Software Engineering, IWMSE '11, pages 25--32, New York, NY, USA, 2011. ACM.

Digital Library

[9]

E. Dijkstra. A Discipline Of Programming. Prentice-Hall series in automatic computation. Prentice-Hall, 1976.

Digital Library

[10]

E. W. Dijkstra. The structure of the \THE" multiprogramming system. In Proceedings of the first ACM symposium on Operating System Principles, SOSP '67, pages 10.1--10.6, New York, NY, USA, 1967. ACM.

Digital Library

[11]

M. Frigo, Steven, and G. Johnson. The Design and Implementation of FFTW3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.

[12]

Intel. Intel Core i7-2600K Processor (8M Cache, 3.40 GHz). http://ark.intel.com/products/52214/Intel-Core-i7-2600K-Processor-%288M-Cache-3 40-GHz%29, 2011. date accessed: July 2011.

[13]

K. Kennedy and K. McKinley. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 301--320. Springer Berlin / Heidelberg, 1994. 10.1007/3-540-57659-2 18.

Digital Library

[14]

Khronos. OpenCL. http://www.khronos.org/opencl/, 2011.

[15]

G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. V. Lopes, J.-M. Loingtier, and J. Irwin. Aspect-Oriented Programming. In ECOOP, pages 220--242, 1997.

[16]

L. Lamport. The Parallel Execution of DO Loops. Commun. ACM, 17(2):83--93, Feb. 1974.

Digital Library

[17]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '09, pages 101--110, New York, NY, USA, 2009. ACM.

Digital Library

[18]

Leung, Alan and Lhotak, Ondrej and Lashari, Ghulam. Automatic parallelization for Graphics Processing Units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ '09, pages 91--100, New York, NY, USA, 2009. ACM.

Digital Library

[19]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, Mar. 2008.

Digital Library

[20]

B. Liskov. In The Second ACM SIGPLAN Conference on History of Programming Languages April 20 -- 23, 1993, Cambridge, United States, New York.

[21]

C. Nugteren and H. Corporaal. Introducing 'Bones': A Parallelizing Source-to-source Compiler Based on Algorithmic Skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 1--10, New York, NY, USA, 2012. ACM.

Digital Library

[22]

NVIDIA. NVIDIA CUDA Programming Guide 2.0. 2008.

[23]

NVIDIA. CUDA Zone. www.nvidia.com/object/cuda home.html, 2010. date accessed: March 2009.

[24]

M. Odersky and al. An Overview of the Scala Programming Language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004.

[25]

D. L. Parnas. On the Criteria To Be Used in Decomposing Systems into Modules. Communications of the ACM, 15:1053--1058, 1972.

Digital Library

[26]

M. Püschel, J. M. F. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Int. J. High Perform. Comput. Appl., 18.

Digital Library

[27]

A. Resios. GPU Performance Prediction using Parametrized Models. Master's thesis, Utrecht University, The Netherlands, 2011.

[28]

K. Rupp, J. Weinbub, and F. Rudolf. Automatic Performance Optimization in ViennaCL for GPUs. In Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing, POOSC '10, pages 6:1--6:6, New York, NY, USA, 2010. ACM.

Digital Library

[29]

E.-M. Sha, C. Lang, and N. Passos. Polynomial-Time Nested Loop Fusion with Full Parallelism. In Parallel Processing, 1996. Vol.3. Software., Proceedings of the 1996 International Conference on, volume 3, pages 9--16 vol.3, aug 1996.

Digital Library

[30]

H. Sutter. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3):202--210, 2005.

[31]

P. Wadler. Deforestation: Transforming Programs to Eliminate Trees. In Proceedings of the 2nd European Symposium on Programming, ESOP '88, pages 344--358, London, UK, UK, 1988. Springer-Verlag.

Digital Library

[32]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for Memory Optimization and Parallelism Management. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.

Digital Library

Cited By

Potter RKeir PBradford RMurray AMcIntosh-Smith SBergen B(2015)Kernel composition in SYCLProceedings of the 3rd International Workshop on OpenCL10.1145/2791321.2791332(1-7)Online publication date: 12-May-2015
https://dl.acm.org/doi/10.1145/2791321.2791332

Index Terms

KFusion: optimizing data flow without compromising modularity
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Generating OpenCL C kernels from OpenACC
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Performance and toolchain of a combined GPU/FPGA desktop (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AOSD '13: Proceedings of the 12th annual international conference on Aspect-oriented software development

March 2013

232 pages

ISBN:9781450317665

DOI:10.1145/2451436

General Chair:
Hidehiko Masuhara
The University of Tokyo, Japan
,
Program Chairs:
Jörg Kienzle
McGill University, Canada
,
Elisa Baniassad
Australian National University, Australia
,
David H. Lorenz
The Open University of Israel, Israel

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

AOSA: Aspect-Oriented Software Association

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AOSD '13

Sponsor:

AOSA

AOSD '13: Aspect-Oriented Software Development

March 24 - 29, 2013

Fukuoka, Japan

Acceptance Rates

Overall Acceptance Rate 41 of 139 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
121
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Potter RKeir PBradford RMurray AMcIntosh-Smith SBergen B(2015)Kernel composition in SYCLProceedings of the 3rd International Workshop on OpenCL10.1145/2791321.2791332(1-7)Online publication date: 12-May-2015
https://dl.acm.org/doi/10.1145/2791321.2791332

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten