Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction

Köster, M.; Groß, J.; Krüger, A.

doi:10.1007/s10766-020-00670-2

Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction

Published: 24 June 2020

Volume 48, pages 675–691, (2020)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

M. Köster¹,
J. Groß¹ &
A. Krüger¹

151 Accesses
2 Citations
Explore all metrics

Abstract

Interpreters are well researched in the field of compiler construction and program generation. They are typically used to realize program execution of different programming languages without a compilation step. However, they can also be used to model complex rule-based simulations: The interpreter applies all rules one after another. These can be iteratively applied on a globally updated state in order to get the final simulation result. Many simulations for domain-specific problems already leverage the parallel processing capabilities of Graphics Processing Units (GPUs). They use hardware-specific tuned rule implementations to achieve maximum performance. However, every interpreter-based system requires a high-level algorithm that detects active rules and determines when they are evaluated. A common approach in this context is the use of different interpreter routines for every problem domain. Executing such functions in an efficient way mainly involves dealing with hardware peculiarities like thread divergences, ALU computations and memory operations. Furthermore, the interpreter is often executed on multiple states in parallel these days. This is particularly important for heuristic search or what-if analyses, for instance. In this paper, we present a novel and easy-to-implement method based on thread compaction to realize generic rule-based interpreters in an efficient way on GPUs. It is optimized for many states using a specially designed memory layout. Benchmarks on our evaluation scenarios show that the performance can be significantly increased in comparison to existing commonly-used implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

A Transformation-Based Approach to Developing High-Performance GPU Programs

Generating Work Efficient Scan Implementations for GPUs the Functional Way

Advancing Automatic Code Generation for Agent-Based Simulations on Heterogeneous Hardware

Notes

A single SIMT unit will be referred to as a warp in the scope of this paper. Furthermore, every warp will be visualized with the help of eight lanes (threads). Groups will be visualized with the help of three warps.
This perfectly fits to the common bank-size configuration of 4 bytes on common GPUs.

References

AMD: AMD Vega Instruction Set Architecture (2019)
Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide simd many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG ’09, pp. 159–166. Association for Computing Machinery, New York, NY, USA (2009)
Campeotto, F., Dal Palù, A., Dovier, A., Fioretto, F., Pontelli, E.: Exploring the use of GPUs in constraint solving. In: Flatt, M., Guo, H.F. (eds.) Practical Aspects of Declarative Languages, pp. 152–167. Springer, Cham (2014)
Chapter Google Scholar
Fung, W.W.L., Aamodt, T.M.: Thread block compaction for efficient SIMT control flow. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)
Granvilliers, L., Hains, G.: A conservative scheme for parallel interval narrowing. Inf. Process. Lett. 74, 141–146 (2000)
Article MathSciNet Google Scholar
Hoberock, J., Lu, V., Jia, Y., Hart, J.: Stream compaction for deferred shading, pp. 173–180 (2009)
Hughes, D.M., Lim, I.S., Jones, M.W., Knoll, A., Spencer, B.: Ink-compact: in-kernel stream compaction and its application to multi-kernel data visualization on general-purpose GPUs. Comput. Graph. Forum 32(6), 178–188 (2013)
Article Google Scholar
Hunger, R.: Floating Point Operations in Matrix–Vector Calculus. Technical report, Technische Universität München (2007)
Kelager, M.: Lagrangian Fluid Dynamics Using Smoothed Particle Hydrodynamics. Cambridge Monographs on Mechanics, vol. 77. University of Copenhagen, Copenhagen (2006)
Google Scholar
Köster, M., Groß, J., Krüger, A.: FANG: fast and efficient successor-state generation for heuristic optimization on GPUs. In: International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Springer (2019)
Köster, M., Groß, J., Krüger, A.: Parallel tracking and reconstruction of states in heuristic optimization systems on GPUs. In: International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT-2019). IEEE (2019)
Köster, M., Krüger, A.: Adaptive position-based fluids: improving performance of fluid simulations for real-time applications. Int. J. Comput. Graph. Animat. 6, 01–16 (2016)
Article Google Scholar
Köster, M., Leißa, R., Hack, S., Membarth, R., Slusallek, P.: Code refinement of stencil codes. Parallel Process. Lett. 24, 1441003 (2014)
Article MathSciNet Google Scholar
Lustig, D., Sahasrabuddhe, S., Giroux, O.: A formal analysis of the NVIDIA PTX memory consistency model. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp. 257–270. Association for Computing Machinery, New York, NY, USA (2019)
Macklin, M., Müller, M.: Position based fluids. ACM Trans. Graph. 32, 1–104 (2013)
Article Google Scholar
Müller, M., Heidelberger, B., Hennix, M., Ratcliff, J.: Position based dynamics. J. Vis. Commun. Image Represent. 18(2), 109–118 (2007)
Article Google Scholar
NVIDIA: Faster Parallel Reductions on Kepler (2014)
NVIDIA: CUDA C Programming Guide v10 (2019)
Pierce, B.C.: Types and Programming Languages, 1st edn. The MIT Press, Cambridge (2002)
MATH Google Scholar
Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. SIGARCH Comput. Archit. News 41(3), 356–367 (2013)
Article Google Scholar
Ruiz-Andino, A., Araujo, L., Sáenz-Pérez, F., Ruz, J.J.: Parallel arc-consistency for functional constraints. In: Implementation Technology for Programming Languages based on Logic (1998)
Wald, I.: Active thread compaction for GPU path tracing. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pp. 51–58. Association for Computing Machinery, New York, NY, USA (2011)

Download references

Acknowledgements

The authors would like to thank Wladimir Panfilenko and Thomas Schmeyer for their suggestions and feedback regarding our method. Furthermore, we would like to thank Gian-Luca Kiefer for additional feedback on the paper. This work was funded by the German Ministry of Education and Research: Project Hybr-iT: Hybrid and Intelligent Human–Robot Collaboration (Grant Number 01IS16026A).

Author information

Authors and Affiliations

Saarland Informatics Campus, Campus D3.2, Saarbrücken, Saarland, Germany
M. Köster, J. Groß & A. Krüger

Authors

M. Köster
View author publications
You can also search for this author in PubMed Google Scholar
J. Groß
View author publications
You can also search for this author in PubMed Google Scholar
A. Krüger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Köster.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Köster, M., Groß, J. & Krüger, A. Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction. Int J Parallel Prog 48, 675–691 (2020). https://doi.org/10.1007/s10766-020-00670-2

Download citation

Received: 08 October 2019
Accepted: 15 June 2020
Published: 24 June 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10766-020-00670-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction

Abstract

Access this article

Similar content being viewed by others

A Transformation-Based Approach to Developing High-Performance GPU Programs

Generating Work Efficient Scan Implementations for GPUs the Functional Way

Advancing Automatic Code Generation for Agent-Based Simulations on Heterogeneous Hardware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction

Abstract

Access this article

Similar content being viewed by others

A Transformation-Based Approach to Developing High-Performance GPU Programs

Generating Work Efficient Scan Implementations for GPUs the Functional Way

Advancing Automatic Code Generation for Agent-Based Simulations on Heterogeneous Hardware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation