ABSTRACT
Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per-stream latency -- to satisfy the milliseconds-long, hard real-time constraints of each stream, and high throughput -- to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs hold high potential to speed up the computations. However, their use for hard real-time stream processing is complicated by slow communications with CPUs, variable throughput changing non-linearly with the input size, and weak consistency of their local memory with respect to CPU accesses. Furthermore, their coarse grain hardware scheduler renders them unsuitable for unbalanced multi-stream workloads.
We present a general, efficient and practical algorithm for hard real-time stream scheduling in heterogeneous systems. The algorithm assigns incoming streams of different rates and deadlines to CPUs and accelerators. By employing novel stream schedulability criteria for accelerators, the algorithm finds the assignment which simultaneously satisfies the aggregate throughput requirements of all the streams and the deadline constraint of each stream alone.
Using the AES-CBC encryption kernel, we experimented extensively on thousands of streams with realistic rate and deadline distributions. Our framework outperformed the alternative methods by allowing 50% more streams to be processed with provably deadline-compliant execution even for deadlines as short as tens milliseconds. Overall, the combined GPU-CPU execution allows for up to 4-fold throughput increase over highly-optimized multi-threaded CPU-only implementations.
- C. Augonnet, S. Thibault, R. Namyst, and P. A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Euro-Par 2009 Parallel Processing, pages 863--874, 2009. Google ScholarDigital Library
- S. K. Baruah. The non-preemptive scheduling of periodic tasks upon multiprocessors. Real-Time Syst., 32:9--20, 2006. Google ScholarDigital Library
- S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. Proportionate progress: A notion of fairness in resource allocation. Algorithmica, 15(6):600--625, 1996.Google ScholarDigital Library
- D. Cederman and P. Tsigas. On sorting and load balancing on GPUs. SIGARCH Comput. Archit. News, 36:11--18, 2009. Google ScholarDigital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In IEEE Intl. Symp. on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.Google ScholarCross Ref
- S. Davari and S. K. Dhall. An on line algorithm for real-time tasks allocation. In IEEE Real-Time Systems Symp., pages 194--200, 1986.Google Scholar
- U. C. Devi. An improved schedulability test for uniprocessor periodic task systems. Euromicro Conf. on Real-Time Systems, 0:23, 2003.Google ScholarCross Ref
- F. Eisenbrand and T. Rothvoβ. EDF-schedulability of synchronous periodic task systems is coNP-hard. In SODA, pages 1029--1034, 2010. Google ScholarDigital Library
- O. Harrison and J. Waldron. AES encryption implementation and analysis on commodity graphics processing units. In CHES, pages 209--226, 2007. Google ScholarDigital Library
- D. A. O. Joppe W. Bos and D. Stefan. Fast implementations of aes on various platforms. Cryptology ePrint Archive, Report 2009/501, 2009. http://eprint.iacr.org/.Google Scholar
- M. Joselli, M. Zamith, E. Clua, A. Montenegro, A. Conci, R. Leal-Toledo, L. Valente, B. Feijó, M. d'Ornellas, and C. Pozzer. Automatic dynamic task distribution between CPU and GPU for real-time systems. 11th IEEE Intl. Conf. on Comp. Science and Engineering (CSE 08)., 0:48--55, 2008. Google ScholarDigital Library
- M. Joselli, M. Zamith, E. Clua, A. Montenegro, R. Leal-Toledo, A. Conci, P. Pagliosa, L. Valente, and B. Feijó. An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU. Comput. Entertain., 7, 2009. Google ScholarDigital Library
- A. Kerr, G. Diamos, and S. Yalamanchili. Modeling GPU-CPU workloads and systems. In GPGPU, pages 31--42, 2010. Google ScholarDigital Library
- C.-F. Kuo and Y.-C. Hai. Real-time task scheduling on heterogeneous two-processor systems. In C.-H. Hsu, L. Yang, J. Park, and S.-S. Yeo, editors, Algorithms and Architectures for Parallel Processing. 2010. Google ScholarDigital Library
- S. Lee, S. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPOPP, pages 101--110, 2009. Google ScholarDigital Library
- C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20:46--61, 1973. Google ScholarDigital Library
- S. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In Signal Processing and Communications, 2007., 2007.Google ScholarCross Ref
- Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka. An efficient, model-based CPU-GPU heterogeneous FFT library. In IPDPS, pages 1--10, 2008.Google Scholar
- S. Ohshima, K. Kise, T. Katagiri, and T. Yuba. Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In Proc. of the 7th intl. conf. on High performance computing for comp. science, VECPAR'06, pages 305--318, 2007. Google ScholarDigital Library
- S. Ramamurthy. Scheduling periodic hard real-time tasks with arbitrary deadlines on multiprocessors. In Proc. of the 23rd IEEE Real-Time Systems Symp., RTSS '02. IEEE Computer Society, 2002. Google ScholarDigital Library
- S. Rarnarnurthy and M. Moir. Static-priority periodic scheduling on multiprocessors. Proc. of the IEEE Real-Time Systems Symp., 0:69, 2000. Google ScholarDigital Library
- L. D. Rose, B. Homer, and D. Johnson. Detecting application load imbalance on high end massively parallel systems. In Euro-Par, pages 150--159, 2007. Google ScholarDigital Library
- S. Schneider, H. Andrade, B. Gedik, K.-L. Wu, and D. S. Nikolopoulos. Evaluation of streaming aggregation on parallel hardware architectures. In DEBS, pages 248--257, 2010. Google ScholarDigital Library
- M. Själander, A. Terechko, and M. Duranton. A look-ahead task management unit for embedded multi-core architectures. In DSD, pages 149--157, 2008. Google ScholarDigital Library
- N. R. Tallent and J. M. Mellor-Crummey. Identifying performance bottlenecks in work-stealing computations. IEEE Computer, 42(11):44--50, 2009. Google ScholarDigital Library
- W. Tang, Z. Lan, N. Desai, and D. Buettner. Fault-aware, utility-based job scheduling on Blue Gene/P systems. In CLUSTER, pages 1--10, 2009.Google ScholarCross Ref
- S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In High Performance Graphics, pages 29--37, 2010. Google ScholarDigital Library
Index Terms
- Processing data streams with hard real-time constraints on heterogeneous systems
Recommendations
Scheduling processing of real-time data streams on heterogeneous multi-GPU systems
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceProcessing vast numbers of data streams is a common problem in modern computer systems and is known as the "online big data problem." Adding hard real-time constraints to the processing makes the scheduling problem a very challenging task that this ...
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Euro-Par 2009In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical ...
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block ...
Comments