Skip to main content
Log in

CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Processing large images in real time requires effective image processing algorithms as well as efficient software design and implementation to take full advantage of all CPU cores and GPU resources on state of the art CPU/GPU platforms. Efficiently coordinating computations on both the host (CPU) and devices (GPUs), along with host–device data transfers is critical to achieving real-time performance. However, such coordination is challenging for system designers given the complexity of modern image processing applications and the targeted processing platforms. In this paper, we present a novel model-based design tool that automates and optimizes these critical design decisions for real-time image processing implementation. The proposed tool consists of a compile-time static analyzer and a run-time dynamic scheduler. The tool automates the process of scheduling dataflow tasks (actors) and coordinating CPU–GPU data transfers in an integrated manner. The approach uses an unfolded dataflow graph representation of the application along with thread-pool-based executors, which are optimized for efficient operation on the targeted CPU–GPU platform. This approach automates the most complicated aspects of the design and implementation process for image processing system designers, while maximizing the utilization of computational power, reducing the memory footprint for both the CPU and GPU, and facilitating experimentation for tuning performance-oriented designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Abbreviations

API:

Application programming interface

CGMBE:

CPU–GPU model-based engine

DIF:

Dataflow interchange format

GSLD:

Global static local dynamic

HEFT:

Heterogeneous earliest finish time

HMBE:

HTGS model-based engine

HTGS:

Hybrid task graph scheduler

SDF:

Synchronous dataflow

SNR:

Signal–noise ratio

SRSDF:

Single-rate synchronous datalow

SVD:

Singular value decomposition

VCA:

Vertex component analysis

References

  1. Advanced Micro Devices, Inc.: AMD EPYC 7002 series datasheet. https://www.amd.com/system/files/documents/AMD-EPYC-7002-Series-Datasheet.pdf (2019). Last access: 2019-09-06

  2. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180, 012037 (2009). https://doi.org/10.1088/1742-6596/180/1/012037

    Article  Google Scholar 

  3. Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: LAPACK: A portable linear algebra library for high-performance computers. In: Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, Supercomputing ’90, pp. 2–11. IEEE Computer Society Press, Los Alamitos, CA, USA (1990). http://dl.acm.org/citation.cfm?id=110382.110385

  4. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631

    Article  Google Scholar 

  5. Balart, J., Duran, A., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: Nanos mercurium: a research compiler for OpenMP. Proc Eur Workshop OpenMP 8, 56 (2004)

    Google Scholar 

  6. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: Expressing locality and independence with logical regions. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). https://doi.org/10.1109/SC.2012.71

  7. Bhattacharyya, S.S., Deprettere, E., Leupers, R., Takala, J. (eds.): Handbook of signal processing systems, 3rd edn. Springer, Berlin (2019)

    Google Scholar 

  8. Blattner, T., Keyrouz, W., Bhattacharyya, S.S., Halem, M., Brady, M.: A hybrid task graph scheduler for high performance image processing workflows. J. Signal Process. Syst. 89(3), 457–467 (2017)

    Article  Google Scholar 

  9. Blattner, T., Keyrouz, W., Chalfoun, J., Stivalet, B., Brady, M., Zhou, S.: A hybrid CPU-GPU system for stitching large scale optical microscopy images. In: Proceedings of the International Conference on Parallel Processing, pp. 1–9 (2014). https://doi.org/10.1109/ICPP.2014.9

  10. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013). https://doi.org/10.1109/MCSE.2013.98

    Article  Google Scholar 

  11. Buck, J.T., Lee, E.A.: Scheduling dynamic dataflow graphs using the token flow model. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp. 429–432 (1993). https://doi.org/10.1109/ICASSP.1993.319147

  12. Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) Applied parallel computing. state of the art in scientific computing, pp. 1–10. Springer, Berlin, Heidelberg (2007)

    MATH  Google Scholar 

  13. Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127 (1992). https://doi.org/10.1109/FMPC.1992.234898

  14. Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., et al.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Programm 13(3), 219–237 (2005)

    Google Scholar 

  15. Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Fut. Gen. Comput. Syst. 46, 17–35 (2015). https://doi.org/10.1016/j.future.2014.10.008

    Article  Google Scholar 

  16. Dias, J.M.B.: VCA algorithm (unmix hyperspectral data) (2019). http://www.lx.it.pt/~bioucas/code.htm. Last Access: 2019-09-06

  17. Duff, I., Lopez, F.: Experiments with sparse cholesky using a parametrized task graph implementation. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) Parallel processing and applied mathematics, pp. 197–206. Springer International Publishing, Cham (2018)

    Chapter  Google Scholar 

  18. Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Para. Process. Lett. 21(02), 173–193 (2011). https://doi.org/10.1142/S0129626411000151

    Article  MathSciNet  Google Scholar 

  19. Eker, J., Janneck, J.W.: Dataflow programming in CAL—balancing expressiveness, analyzability, and implementability. In: 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1120–1124 (2012). https://doi.org/10.1109/ACSSC.2012.6489194

  20. Gao, G.R., Govindarajan, R., Panangaden, P.: Well-behaved programs for DSP computation. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol 5, pp. 561–564 (1992). https://doi.org/10.1109/ICASSP.1992.226558

  21. Google, Inc.: Protocol buffers. (2017). https://developers.google.com/protocol-buffers. Accessed 8 Sept 2019

  22. Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010). http://eigen.tuxfamily.org. Accessed 8 Sept 2019

  23. Horizon 2020 FET-HPC project: Parallel numerical linear algebra for extreme scale systems (2019). http://www.nlafet.eu, visited on July 31, 2019

  24. Keinert, J., Haubelt, C., Teich, J.: Modeling and analysis of windowed synchronous algorithms. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings vol 3, pp. III-III (2006). https://doi.org/10.1109/ICASSP.2006.1660798

  25. Kwok, Y., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. J. Assoc. Comput. Mach. 31(4), 406–471 (1999)

    Google Scholar 

  26. Lea, D.: Concurrent programming in Java: design principles and patterns, 2nd edn. Addison-Wesley, Boston (1999)

    MATH  Google Scholar 

  27. Lee, E.A., Messerschmitt, D.G.: Synchronous dataflow. Proc. IEEE 75(9), 1235–1245 (1987)

    Article  Google Scholar 

  28. Lee, E.A., Parks, T.M.: Dataflow process networks. Proc IEEE 83(5), 773–801 (1995)

    Article  Google Scholar 

  29. Lin, S., Liu, Y., Lee, K., Li, L., Plishker, W., Bhattacharyya, S.S.: The DSPCAD framework for modeling and synthesis of signal processing systems. In: Ha, S., Teich, J. (eds.) Handbook of hardware/software codesign, pp. 1–35. Springer, Berlin (2017)

    Google Scholar 

  30. Liu, Y., Barford, L., Bhattacharyya, S.S.: Generalized graph connections for dataflow modeling of DSP applications. In: 2018 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 275–280. Cape Town, South Africa (2018). https://doi.org/10.1109/SiPS.2018.8598305

  31. Nascimento, J.M.P., Dias, J.M.B.: Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(4), 898–910 (2005)

    Article  Google Scholar 

  32. OpenBLAS: An optimized BLAS library. https://www.openblas.net/. Last Access: 2019-09-09

  33. Palumbo, F., Carta, N., Raffo, L.: The multi-dataflow composer tool: A runtime reconfigurable HDL platform composer. In: Proceedings of the 2011 Conference on Design Architectures for Signal Image Processing (DASIP), pp. 1–8 (2011). https://doi.org/10.1109/DASIP.2011.6136876

  34. Pelcat, M., Menuet, P., Aridhi, S., Nezan, J.F.: Scalable compile-time scheduler for multi-core architectures. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 1552–1555 (2009). https://doi.org/10.1109/DATE.2009.5090909

  35. Sahoo, D.R., Swaminathan, S., Al-Omari, R., Salapaka, M.V., Manimaran, G., Somani, A.K.: Feedback control for real-time scheduling. In: Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301), vol. 2, pp. 1254–1259 (2002). https://doi.org/10.1109/ACC.2002.1023192

  36. Sriram, S., Bhattacharyya, S.S.: Embedded Multiprocessors: Scheduling and Synchronization, 2nd edn. CRC Press (2009). ISBN 1420048015. http://www.ece.umd.edu/DSPCAD/papers/srir2009x1-flyer.pdf

  37. The HDF Group: High level introduction to HDF5. https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf (2016). Last Access: 2019-09-06

  38. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Para. Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  39. Wang, Q., Zhang, X., Zhang, Y., Yi, Q.: AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC’13, pp. 25:1–25:12. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2503210.2503219

  40. Wu, J., Blattner, T., Keyrouz, W., Bhattacharyya, S.S.: Model-based dynamic scheduling for multicore implementation of image processing systems. In: 2017 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 1–6. Lorient, France (2017). https://doi.org/10.1109/SiPS.2017.8110003

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiahao Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Xie, J., Bardakoff, A. et al. CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms. J Real-Time Image Proc 18, 561–583 (2021). https://doi.org/10.1007/s11554-020-00994-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-020-00994-9

Navigation