skip to main content
research-article

Coarse-Grained Task Parallelization by Dynamic Profiling for Heterogeneous SoC-Based Embedded System

Published: 10 December 2024 Publication History

Abstract

In this study, we introduce a methodology for automatically transforming user applications written in C/C++ to a parallel representation consisting of coarse-grained tasks based on dynamic profiling. Such a parallel representation is suitable for mapping applications onto heterogeneous SoCs. We present our approach for instrumenting the user application binary during the compilation process with parallel primitives that enable the runtime system to schedule and execute independent computation-intensive coarse-grained tasks concurrently. We use the proposed compilation and code transformation methodology to retarget each application for execution on a heterogeneous SoC composed of processor cores and accelerators. We demonstrate the capabilities of our integrated compile time and runtime flow through task-level parallelization and functionally correct execution of real-world applications in the communication systems and radar processing domains. We demonstrate the functionality of our integrated system by executing six distinct applications with different degrees of parallelism on four different platforms: an eight-core general-purpose processor, a heterogeneous SoC simulator, and two heterogeneous SoCs utilizing the Xilinx Zynq UltraScale+ FPGA and the Nvidia Jetson AGX board. Our integrated approach offers a path forward for application developers to take full advantage of the target SoC without requiring users to become hardware or parallel programming experts.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI’16). USENIX Association, USA, 265–283.
[2]
Samet E. Arda, Anish Krishnakumar, A. Alper Goksoy, Nirmal Kumbhare, Joshua Mack, Anderson L. Sartor, Ali Akoglu, Radu Marculescu, and Umit Y. Ogras. 2020. DS3: A system-level domain-specific system-on-chip simulation framework. IEEE Trans. Comput. 69, 8 (2020), 1248–1262.
[3]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187–198.
[4]
Thomas Ball and James R. Larus. 1994. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 4 (1994), 1319–1360.
[5]
D. W. Bliss, T. Ajayi, A. Akoglu, I. Aliyev, T. Basaklar, L. Belayneh, D. Blaauw, J. Brunhaver, C. Chakrabarti, L. Chang, K.-Y. Chen, M.-H. Chen, X. Chen, A. R. Chiriyath, A. Daftardar, R. Dreslinski, A. Dutta, A. J. Farcas, Y. Fu, A. Goksoy, X. He, Md. S. Hassan, A. Herschfelt, J. Holtom, H.-S. Kim, A. N. Krishnakumar, Y. Li, O. Ma, J. Mack, S. Mallik, S. K. Mandal, R. Marculescu, B. McCall, T. Mudge, U. Y. Ogras, V. Pandey, S. Siddiqui, Y.-H. Sun, A. Venkataramani, X. Wei, B. R. Willis, H. Yu, and Y. Yue. 2022. Enabling software-defined RF convergence with a novel coarse-scale heterogeneous processor. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS). 443–447.
[6]
David R. Butenhof. 1997. Programming with POSIX Threads. Addison-Wesley Professional.
[7]
Liangliang Chang, Joshua Mack, Benjamin Willis, Xing Chen, John Brunhaver, Ali Akoglu, and Chaitali Chakrabarti. 2022. Profile-guided parallel task extraction and execution for domain specific heterogeneous SoC. In 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom’22). IEEE, 913–920.
[8]
Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 204–213.
[9]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46–55.
[10]
André Luís del Mestre Martins, Alzemiro Henrique Lucas da Silva, Amir M. Rahmani, Nikil Dutt, and Fernando Gehm Moraes. 2019. Hierarchical adaptive Multi-objective resource management for many-core systems. Journal of Systems Architecture 97 (2019), 416–427. DOI:
[11]
Bryan Donyanavard, Tiago Mück, Amir M. Rahmani, Nikil Dutt, Armin Sadighi, Florian Maurer, and Andreas Herkersdorf. 2019. SOSA: Self-optimizing learning with self-adaptive control for hierarchical system-on-chip management. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). ACM, New York, 685–698. DOI:
[12]
Bryan Donyanavard, Tiago Mück, Santanu Sarma, and Nikil Dutt. 2016. SPARTA: Runtime task allocation for energy efficient heterogeneous manycores. In 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’16). 1–10.
[13]
Joseph A. Fisher. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers 30, 07 (1981), 478–490.
[14]
Saturnino Garcia, Donghwan Jeon, Christopher M. Louie, and Michael Bedford Taylor. 2011. Kremlin: Rethinking and rebooting gprof for the multicore age. ACM SIGPLAN Notices 46, 6 (2011), 458–469.
[15]
Alain Ketterlin and Philippe Clauss. 2012. Profiling data-dependence to assist parallelization: Framework, scope, and optimization. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 437–448.
[16]
Tanvir Ahmed Khan, Muhammed Ugur, Krishnendra Nathella, Dam Sunwoo, Heiner Litz, Daniel A Jiménez, and Baris Kasikci. 2022. Whisper: Profile-guided branch misprediction elimination for data center applications. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22). IEEE, 19–34.
[17]
Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-guided instruction cache replacement for data center applications. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). IEEE, 734–747.
[18]
Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. 2010. SD3: A scalable approach to dynamic data-dependence profiling. In 2010 43rd IEEE/ACM International Symposium on Microarchitecture. IEEE, 535–546.
[19]
Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous parallel virtual machine. In 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 68–80.
[20]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization,. IEEE, 75–86.
[21]
Joshua Mack, Sahil Hassan, Nirmal Kumbhare, Miguel Gonzales, and Ali Akoglu. 2022. CEDR - A compiler-integrated, extensible DSSoC runtime. ACM Transactions on Embedded Computing Systems (April2022). DOI:
[22]
Biswadip Maity, Bryan Donyanavard, Anmol Surhonne, Amir Rahmani, Andreas Herkersdorf, and Nikil Dutt. 2021. SEAMS: Self-optimizing runtime manager for approximate memory hierarchies. ACM Transactions on Embedded Computing Systems 20, 5 (July2021), 48:1–48:26. DOI:https://doi.org/10/gm3hnz
[23]
Kasra Moazzemi, Biswadip Maity, Saehanseul Yi, Amir M. Rahmani, and Nikil Dutt. 2019. HESSLE-FREE: Heterogeneous systems leveraging fuzzy control for runtime resource management. ACM Transactions on Embedded Computing Systems 18, 5s (Oct.2019), 74:1–74:19. DOI:
[24]
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A practical binary optimizer for data centers and beyond. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE, 2–14.
[25]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.
[26]
Greg Stitt and David Campbell. 2020. PANDORA: An architecture-independent parallelizing approximation-discovery framework. ACM Trans. Embed. Comput. Syst. 19, 5, Article 39 (Nov. 2020), 17 pages. DOI:
[27]
Enrico Tabanelli, Giuseppe Tagliavini, and Luca Benini. 2023. DNN is not all you need: Parallelizing non-neural ML algorithms on ultra-low-power IoT processors. ACM Trans. Embed. Comput. Syst. 22, 3, Article 56 (Apr. 2023), 33 pages. DOI:
[28]
Xubin Tan, Jaume Bosch, Carlos Álvarez, Daniel Jiménez-González, Eduard Ayguadé, and Mateo Valero. 2019. A hardware runtime for task-based programming models. IEEE Transactions on Parallel and Distributed Systems 30, 9 (2019), 1932–1946. DOI:
[29]
Richard Uhrie, Chaitali Chakrabarti, and John Brunhaver. 2020. Automated parallel kernel extraction from dynamic application traces. arXiv preprint arXiv:2001.09995 (2020).
[30]
Maja Vukasovic and Aleksandar Prokopec. 2023. Exploiting partially context-sensitive profiles to improve performance of hot code. ACM Transactions on Programming Languages and Systems 45, 4 (2023), 1–64.
[31]
Zheng Wang, Georgios Tournavitis, Björn Franke, and Michael F. P. O’boyle. 2014. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization (TACO) 11, 1 (2014), 1–26.
[32]
Xilinx ZCU102 [n. d.]. ZCU102 Evaluation Board. Retrieved May 3, 2024 from https://docs.amd.com/v/u/en-US/ug1182-zcu102-eval-bd
[33]
Georgios Zacharopoulos, Adel Ejjeh, Ying Jing, En-Yu Yang, Tianyu Jia, Iulian Brumar, Jeremy Intan, Muhammad Huzaifa, Sarita Adve, Vikram Adve, Gu-Yeon Wei, and David Brooks. 2023. Trireme: Exploration of hierarchical multi-level parallelism for hardware acceleration. ACM Trans. Embed. Comput. Syst. 22, 3, Article 53 (Apr. 2023), 23 pages. DOI:
[34]
Yuxuan Zhang, Nathan Sobotka, Soyoon Park, Saba Jamilan, Tanvir Ahmed Khan, Baris Kasikci, Gilles A. Pokam, Heiner Litz, and Joseph Devietti. 2024. RPG2: Robust profile-guided runtime prefetch generation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 999–1013.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 24, Issue 1
January 2025
664 pages
EISSN:1558-3465
DOI:10.1145/3696805
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 10 December 2024
Online AM: 15 November 2024
Accepted: 06 November 2024
Revised: 26 September 2024
Received: 22 May 2024
Published in TECS Volume 24, Issue 1

Check for updates

Author Tags

  1. Coarse-grained task parallelization
  2. dynamic profiling
  3. heterogeneous SoC and runtime
  4. parallelism detection

Qualifiers

  • Research-article

Funding Sources

  • Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 134
    Total Downloads
  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)29
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media