Skip to main content
Log in

Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Interpreters are well researched in the field of compiler construction and program generation. They are typically used to realize program execution of different programming languages without a compilation step. However, they can also be used to model complex rule-based simulations: The interpreter applies all rules one after another. These can be iteratively applied on a globally updated state in order to get the final simulation result. Many simulations for domain-specific problems already leverage the parallel processing capabilities of Graphics Processing Units (GPUs). They use hardware-specific tuned rule implementations to achieve maximum performance. However, every interpreter-based system requires a high-level algorithm that detects active rules and determines when they are evaluated. A common approach in this context is the use of different interpreter routines for every problem domain. Executing such functions in an efficient way mainly involves dealing with hardware peculiarities like thread divergences, ALU computations and memory operations. Furthermore, the interpreter is often executed on multiple states in parallel these days. This is particularly important for heuristic search or what-if analyses, for instance. In this paper, we present a novel and easy-to-implement method based on thread compaction to realize generic rule-based interpreters in an efficient way on GPUs. It is optimized for many states using a specially designed memory layout. Benchmarks on our evaluation scenarios show that the performance can be significantly increased in comparison to existing commonly-used implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. A single SIMT unit will be referred to as a warp in the scope of this paper. Furthermore, every warp will be visualized with the help of eight lanes (threads). Groups will be visualized with the help of three warps.

  2. This perfectly fits to the common bank-size configuration of 4 bytes on common GPUs.

References

  1. AMD: AMD Vega Instruction Set Architecture (2019)

  2. Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide simd many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG ’09, pp. 159–166. Association for Computing Machinery, New York, NY, USA (2009)

  3. Campeotto, F., Dal Palù, A., Dovier, A., Fioretto, F., Pontelli, E.: Exploring the use of GPUs in constraint solving. In: Flatt, M., Guo, H.F. (eds.) Practical Aspects of Declarative Languages, pp. 152–167. Springer, Cham (2014)

    Chapter  Google Scholar 

  4. Fung, W.W.L., Aamodt, T.M.: Thread block compaction for efficient SIMT control flow. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)

  5. Granvilliers, L., Hains, G.: A conservative scheme for parallel interval narrowing. Inf. Process. Lett. 74, 141–146 (2000)

    Article  MathSciNet  Google Scholar 

  6. Hoberock, J., Lu, V., Jia, Y., Hart, J.: Stream compaction for deferred shading, pp. 173–180 (2009)

  7. Hughes, D.M., Lim, I.S., Jones, M.W., Knoll, A., Spencer, B.: Ink-compact: in-kernel stream compaction and its application to multi-kernel data visualization on general-purpose GPUs. Comput. Graph. Forum 32(6), 178–188 (2013)

    Article  Google Scholar 

  8. Hunger, R.: Floating Point Operations in Matrix–Vector Calculus. Technical report, Technische Universität München (2007)

  9. Kelager, M.: Lagrangian Fluid Dynamics Using Smoothed Particle Hydrodynamics. Cambridge Monographs on Mechanics, vol. 77. University of Copenhagen, Copenhagen (2006)

    Google Scholar 

  10. Köster, M., Groß, J., Krüger, A.: FANG: fast and efficient successor-state generation for heuristic optimization on GPUs. In: International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Springer (2019)

  11. Köster, M., Groß, J., Krüger, A.: Parallel tracking and reconstruction of states in heuristic optimization systems on GPUs. In: International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT-2019). IEEE (2019)

  12. Köster, M., Krüger, A.: Adaptive position-based fluids: improving performance of fluid simulations for real-time applications. Int. J. Comput. Graph. Animat. 6, 01–16 (2016)

    Article  Google Scholar 

  13. Köster, M., Leißa, R., Hack, S., Membarth, R., Slusallek, P.: Code refinement of stencil codes. Parallel Process. Lett. 24, 1441003 (2014)

    Article  MathSciNet  Google Scholar 

  14. Lustig, D., Sahasrabuddhe, S., Giroux, O.: A formal analysis of the NVIDIA PTX memory consistency model. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp. 257–270. Association for Computing Machinery, New York, NY, USA (2019)

  15. Macklin, M., Müller, M.: Position based fluids. ACM Trans. Graph. 32, 1–104 (2013)

    Article  Google Scholar 

  16. Müller, M., Heidelberger, B., Hennix, M., Ratcliff, J.: Position based dynamics. J. Vis. Commun. Image Represent. 18(2), 109–118 (2007)

    Article  Google Scholar 

  17. NVIDIA: Faster Parallel Reductions on Kepler (2014)

  18. NVIDIA: CUDA C Programming Guide v10 (2019)

  19. Pierce, B.C.: Types and Programming Languages, 1st edn. The MIT Press, Cambridge (2002)

    MATH  Google Scholar 

  20. Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. SIGARCH Comput. Archit. News 41(3), 356–367 (2013)

    Article  Google Scholar 

  21. Ruiz-Andino, A., Araujo, L., Sáenz-Pérez, F., Ruz, J.J.: Parallel arc-consistency for functional constraints. In: Implementation Technology for Programming Languages based on Logic (1998)

  22. Wald, I.: Active thread compaction for GPU path tracing. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pp. 51–58. Association for Computing Machinery, New York, NY, USA (2011)

Download references

Acknowledgements

The authors would like to thank Wladimir Panfilenko and Thomas Schmeyer for their suggestions and feedback regarding our method. Furthermore, we would like to thank Gian-Luca Kiefer for additional feedback on the paper. This work was funded by the German Ministry of Education and Research: Project Hybr-iT: Hybrid and Intelligent Human–Robot Collaboration (Grant Number 01IS16026A).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Köster.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Köster, M., Groß, J. & Krüger, A. Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction. Int J Parallel Prog 48, 675–691 (2020). https://doi.org/10.1007/s10766-020-00670-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00670-2

Keywords

Navigation