CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

Wu, Jiahao; Xie, Jing; Bardakoff, Alexandre; Blattner, Timothy; Keyrouz, Walid; Bhattacharyya, Shuvra S.

doi:10.1007/s11554-020-00994-9

CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

Original Research Paper
Published: 07 July 2020

Volume 18, pages 561–583, (2021)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Jiahao Wu ORCID: orcid.org/0000-0003-4786-2559¹,
Jing Xie¹,
Alexandre Bardakoff²,
Timothy Blattner²,
Walid Keyrouz² &
…
Shuvra S. Bhattacharyya¹

303 Accesses
3 Citations
Explore all metrics

Abstract

Processing large images in real time requires effective image processing algorithms as well as efficient software design and implementation to take full advantage of all CPU cores and GPU resources on state of the art CPU/GPU platforms. Efficiently coordinating computations on both the host (CPU) and devices (GPUs), along with host–device data transfers is critical to achieving real-time performance. However, such coordination is challenging for system designers given the complexity of modern image processing applications and the targeted processing platforms. In this paper, we present a novel model-based design tool that automates and optimizes these critical design decisions for real-time image processing implementation. The proposed tool consists of a compile-time static analyzer and a run-time dynamic scheduler. The tool automates the process of scheduling dataflow tasks (actors) and coordinating CPU–GPU data transfers in an integrated manner. The approach uses an unfolded dataflow graph representation of the application along with thread-pool-based executors, which are optimized for efficient operation on the targeted CPU–GPU platform. This approach automates the most complicated aspects of the design and implementation process for image processing system designers, while maximizing the utilization of computational power, reducing the memory footprint for both the CPU and GPU, and facilitating experimentation for tuning performance-oriented designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Optical experimental solution for the multiway number partitioning problem and its application to computing power scheduling

Article 03 August 2023

In-memory database acceleration on FPGAs: a survey

Article Open access 26 October 2019

Abbreviations

API:: Application programming interface
CGMBE:: CPU–GPU model-based engine
DIF:: Dataflow interchange format
GSLD:: Global static local dynamic
HEFT:: Heterogeneous earliest finish time
HMBE:: HTGS model-based engine
HTGS:: Hybrid task graph scheduler
SDF:: Synchronous dataflow
SNR:: Signal–noise ratio
SRSDF:: Single-rate synchronous datalow
SVD:: Singular value decomposition
VCA:: Vertex component analysis

References

Advanced Micro Devices, Inc.: AMD EPYC 7002 series datasheet. https://www.amd.com/system/files/documents/AMD-EPYC-7002-Series-Datasheet.pdf (2019). Last access: 2019-09-06
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180, 012037 (2009). https://doi.org/10.1088/1742-6596/180/1/012037
Article Google Scholar
Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: LAPACK: A portable linear algebra library for high-performance computers. In: Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, Supercomputing ’90, pp. 2–11. IEEE Computer Society Press, Los Alamitos, CA, USA (1990). http://dl.acm.org/citation.cfm?id=110382.110385
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631
Article Google Scholar
Balart, J., Duran, A., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: Nanos mercurium: a research compiler for OpenMP. Proc Eur Workshop OpenMP 8, 56 (2004)
Google Scholar
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: Expressing locality and independence with logical regions. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). https://doi.org/10.1109/SC.2012.71
Bhattacharyya, S.S., Deprettere, E., Leupers, R., Takala, J. (eds.): Handbook of signal processing systems, 3rd edn. Springer, Berlin (2019)
Google Scholar
Blattner, T., Keyrouz, W., Bhattacharyya, S.S., Halem, M., Brady, M.: A hybrid task graph scheduler for high performance image processing workflows. J. Signal Process. Syst. 89(3), 457–467 (2017)
Article Google Scholar
Blattner, T., Keyrouz, W., Chalfoun, J., Stivalet, B., Brady, M., Zhou, S.: A hybrid CPU-GPU system for stitching large scale optical microscopy images. In: Proceedings of the International Conference on Parallel Processing, pp. 1–9 (2014). https://doi.org/10.1109/ICPP.2014.9
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013). https://doi.org/10.1109/MCSE.2013.98
Article Google Scholar
Buck, J.T., Lee, E.A.: Scheduling dynamic dataflow graphs using the token flow model. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp. 429–432 (1993). https://doi.org/10.1109/ICASSP.1993.319147
Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) Applied parallel computing. state of the art in scientific computing, pp. 1–10. Springer, Berlin, Heidelberg (2007)
MATH Google Scholar
Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127 (1992). https://doi.org/10.1109/FMPC.1992.234898
Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., et al.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Programm 13(3), 219–237 (2005)
Google Scholar
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Fut. Gen. Comput. Syst. 46, 17–35 (2015). https://doi.org/10.1016/j.future.2014.10.008
Article Google Scholar
Dias, J.M.B.: VCA algorithm (unmix hyperspectral data) (2019). http://www.lx.it.pt/~bioucas/code.htm. Last Access: 2019-09-06
Duff, I., Lopez, F.: Experiments with sparse cholesky using a parametrized task graph implementation. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) Parallel processing and applied mathematics, pp. 197–206. Springer International Publishing, Cham (2018)
Chapter Google Scholar
Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Para. Process. Lett. 21(02), 173–193 (2011). https://doi.org/10.1142/S0129626411000151
Article MathSciNet Google Scholar
Eker, J., Janneck, J.W.: Dataflow programming in CAL—balancing expressiveness, analyzability, and implementability. In: 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1120–1124 (2012). https://doi.org/10.1109/ACSSC.2012.6489194
Gao, G.R., Govindarajan, R., Panangaden, P.: Well-behaved programs for DSP computation. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol 5, pp. 561–564 (1992). https://doi.org/10.1109/ICASSP.1992.226558
Google, Inc.: Protocol buffers. (2017). https://developers.google.com/protocol-buffers. Accessed 8 Sept 2019
Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010). http://eigen.tuxfamily.org. Accessed 8 Sept 2019
Horizon 2020 FET-HPC project: Parallel numerical linear algebra for extreme scale systems (2019). http://www.nlafet.eu, visited on July 31, 2019
Keinert, J., Haubelt, C., Teich, J.: Modeling and analysis of windowed synchronous algorithms. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings vol 3, pp. III-III (2006). https://doi.org/10.1109/ICASSP.2006.1660798
Kwok, Y., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. J. Assoc. Comput. Mach. 31(4), 406–471 (1999)
Google Scholar
Lea, D.: Concurrent programming in Java: design principles and patterns, 2nd edn. Addison-Wesley, Boston (1999)
MATH Google Scholar
Lee, E.A., Messerschmitt, D.G.: Synchronous dataflow. Proc. IEEE 75(9), 1235–1245 (1987)
Article Google Scholar
Lee, E.A., Parks, T.M.: Dataflow process networks. Proc IEEE 83(5), 773–801 (1995)
Article Google Scholar
Lin, S., Liu, Y., Lee, K., Li, L., Plishker, W., Bhattacharyya, S.S.: The DSPCAD framework for modeling and synthesis of signal processing systems. In: Ha, S., Teich, J. (eds.) Handbook of hardware/software codesign, pp. 1–35. Springer, Berlin (2017)
Google Scholar
Liu, Y., Barford, L., Bhattacharyya, S.S.: Generalized graph connections for dataflow modeling of DSP applications. In: 2018 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 275–280. Cape Town, South Africa (2018). https://doi.org/10.1109/SiPS.2018.8598305
Nascimento, J.M.P., Dias, J.M.B.: Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(4), 898–910 (2005)
Article Google Scholar
OpenBLAS: An optimized BLAS library. https://www.openblas.net/. Last Access: 2019-09-09
Palumbo, F., Carta, N., Raffo, L.: The multi-dataflow composer tool: A runtime reconfigurable HDL platform composer. In: Proceedings of the 2011 Conference on Design Architectures for Signal Image Processing (DASIP), pp. 1–8 (2011). https://doi.org/10.1109/DASIP.2011.6136876
Pelcat, M., Menuet, P., Aridhi, S., Nezan, J.F.: Scalable compile-time scheduler for multi-core architectures. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 1552–1555 (2009). https://doi.org/10.1109/DATE.2009.5090909
Sahoo, D.R., Swaminathan, S., Al-Omari, R., Salapaka, M.V., Manimaran, G., Somani, A.K.: Feedback control for real-time scheduling. In: Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301), vol. 2, pp. 1254–1259 (2002). https://doi.org/10.1109/ACC.2002.1023192
Sriram, S., Bhattacharyya, S.S.: Embedded Multiprocessors: Scheduling and Synchronization, 2nd edn. CRC Press (2009). ISBN 1420048015. http://www.ece.umd.edu/DSPCAD/papers/srir2009x1-flyer.pdf
The HDF Group: High level introduction to HDF5. https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf (2016). Last Access: 2019-09-06
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Para. Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Wang, Q., Zhang, X., Zhang, Y., Yi, Q.: AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC’13, pp. 25:1–25:12. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2503210.2503219
Wu, J., Blattner, T., Keyrouz, W., Bhattacharyya, S.S.: Model-based dynamic scheduling for multicore implementation of image processing systems. In: 2017 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 1–6. Lorient, France (2017). https://doi.org/10.1109/SiPS.2017.8110003

Download references

Author information

Authors and Affiliations

University of Maryland, College Park, MD, USA
Jiahao Wu, Jing Xie & Shuvra S. Bhattacharyya
National Institute of Standards and Technology, Gaithersburg, MD, USA
Alexandre Bardakoff, Timothy Blattner & Walid Keyrouz

Authors

Jiahao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xie
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Bardakoff
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Blattner
View author publications
You can also search for this author in PubMed Google Scholar
Walid Keyrouz
View author publications
You can also search for this author in PubMed Google Scholar
Shuvra S. Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiahao Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xie, J., Bardakoff, A. et al. CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms. J Real-Time Image Proc 18, 561–583 (2021). https://doi.org/10.1007/s11554-020-00994-9

Download citation

Received: 13 September 2019
Accepted: 22 June 2020
Published: 07 July 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11554-020-00994-9

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Optical experimental solution for the multiway number partitioning problem and its application to computing power scheduling

In-memory database acceleration on FPGAs: a survey

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Navigation

CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Optical experimental solution for the multiway number partitioning problem and its application to computing power scheduling

In-memory database acceleration on FPGAs: a survey

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation