Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Wang, Yan; Li, Kenli; Li, Keqin

doi:10.1007/s10766-016-0445-2

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Published: 15 July 2016

Volume 45, pages 827–852, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Yan Wang^1,2,
Kenli Li² &
Keqin Li²

407 Accesses
13 Citations
Explore all metrics

Abstract

This paper addresses the scheduling problem for multi-dimensional loops applications on heterogeneous multicore processors. In the multi-dimensional loops scheduling problem, a significant issue is how to hide memory latency to reduce the schedule length. With the increasing CPU speed, the gap between the processor and memory performance is an important bottleneck for modern high-performance computer systems. To solve the bottleneck problem, a variety of techniques have been studied to hide memory latency from intermediate fast memories (caches) to various prefetching and memory management techniques. Although there are a lot of algorithms in the literature to solve the scheduling with memory management problem for multiprocessor systems, they may not deliver good quality with high performance for heterogeneous multicore processors. In this paper, we first propose a scheduling algorithm Recom_Task_Assign to reduce the write activities to main memory. Then, in conjunction with the Recom_Task_Assign algorithm, we present a new partition scheduling algorithm called heterogeneous multiprocessor partition (HMP) based on the prefetching technique for heterogeneous multicore processors, which can hide memory latencies for applications with multi-dimensional loops. This technique takes advantage of memory access pattern information and fully considers the heterogeneity of processors to achieve high processor utilization. Our HMP algorithm selects the appropriate partition size and shape according to different processors, which increases processor utilization and reduces memory latency. Experiments on DSP benchmarks show that our algorithm can efficiently reduce memory latency and enhance parallelism compared with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting the Parallel Strategy for DOACROSS Loops

Article 22 March 2019

Improvement of Load Balancing in Shared-Memory Multiprocessor Systems

Workload Aware Dynamic Scheduling Algorithm for Multi-core Systems

References

Bala, K., Kaashoek, M.F., Weihl, W.E.: Software prefetching and caching for translation lookaside buffers. In: Proceedings of the 1st USENIX Conference on Operating Systems Design and Implementation, p. 18. USENIX Association (1994)
Beaumont, O., Boudet, V., Robert, Y., et al.: A realistic model and an efficient heuristic for scheduling with heterogeneous processors (2001)
Belviranli, M.E., Bhuyan, L.N., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. (TACO) 9(4), 57 (2013)
Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: Dague: a generic distributed dag engine for high performance computing. Parallel Comput. 38(1), 37–51 (2012)
Article Google Scholar
Chen, J., Tao, X., Yang, Z., Peir, J.-K., Li, X., Lu, S.-L.: Guided region-based gpu scheduling: utilizing multi-thread parallelism to hide memory latency. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 441–451. IEEE (2013)
Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 155–164. ACM (2008)
Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, pp. 223–232. IEEE (1994)
Chen, T.-F., Baer, J.-L.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)
Article MATH Google Scholar
Chen, Y., Liao, H., Tsai, T.: On-line real-time task scheduling in heterogeneous multi-core system-on-a-chip. IEEE Trans. Parallel Distrib. Syst. 24, 118–130 (2013)
Article Google Scholar
Chu, M., Ravindran, R., Mahlke, S.: Data access partitioning for fine-grain parallelism on multicore architectures. In: MICRO 2007. 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 369–380. IEEE (2007)
Dahlgren, F., Dubois, M., Stenstrom, P.: Sequential hardware prefetching in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 6(7), 733–746 (1995)
Article Google Scholar
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Article MATH Google Scholar
Eryigit, S., Bayhan, S., Tugcu, T.: Energy-efficient multi-channel cooperative sensing scheduling with heterogeneous channel conditions for cognitive radio networks. IEEE Trans. Veh. Technol. 62, 2690–2699 (2013)
Article Google Scholar
Ganusov, I., Burtscher, M.: Future execution: a hardware prefetching technique for chip multiprocessors. In: 14th International Conference on Parallel Architectures and Compilation Techniques, 2005. PACT 2005, pp. 350–360. IEEE (2005)
Hagras, T., Janeček, J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)
Article MATH Google Scholar
Hoogerbrugge, J., Terechko, A.: A multithreaded multicore system for embedded media processing. In: Transactions on High-performance Embedded Architectures and Compilers III, pp. 154–173. Springer, Berlin (2011)
http://www.androidheadlines.com/2013/09/samsung-upgrades-exynos-5-to-true-octa-core-status-with-heterogeneous-multi-processing.html (2013)
Hu, J., Xue, C.J., Tseng, W.-C., Zhuge, Q., Sha, E.-M.: Minimizing write activities to non-volatile memory via scheduling and recomputation. In: 2010 IEEE 8th Symposium on Application Specific Processors (SASP), pp. 101–106. IEEE (2010)
Jeong, J., Kim, H., Hwang, J., Lee, J., Maeng, S.: Rigorous rental memory management for embedded systems. ACM Trans. Embed. Comput. Syst. (TECS) 12(1s), 43 (2013)
Google Scholar
Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: ACM SIGARCH Computer Architecture News, vol. 19, pp. 43–53. ACM (1991)
Lilja, D.J.: The impact of parallel loop scheduling strategies on prefetching in a shared memory multiprocessor. IEEE Trans. Parallel Distrib. Syst. 5(6), 573–584 (1994)
Article Google Scholar
Liu, G., Abdelrahman, T.: Computation–communication overlap on network-of-workstation multiprocessors. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1635–1642 (1998)
Luk, C.-K.: Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In: Proceedings. 28th Annual International Symposium on Computer Architecture, 2001, pp. 40–51. IEEE (2001)
Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12(2), 87–106 (1991)
Article Google Scholar
Nishiyama, H., Kikuchi, S.: Method for compiling loops containing prefetch instructions that replaces one or more actual prefetches with one virtual prefetch prior to loop scheduling and unrolling, Sept. 7 1999. US Patent 5,950,007
Orlando, S., Perego, R.: Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer. Microprocess. Microprogram. 41(8), 645–658 (1996)
Article Google Scholar
Page, A.J., Naughton, T.J.: Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing. In: Proceedings. 19th IEEE International, Parallel and Distributed Processing Symposium, 2005, p. 189a. IEEE (2005)
Poulsen, D.K., Yew, P.-C.: Data prefetching and data forwarding in shared memory multiprocessors. In: International Conference on Parallel Processing, 1994. ICPP 1994, vol. 2, pp. 280–280. IEEE (1994)
Qiu, M., Liu, M., Hu, F., Liu, S., Wang, L.: Energy aware loop scheduling for high performance multi-module memory. In: Sixth IFIP International Conference on Network and Parallel Computing, 2009. NPC’09, pp. 16–22. IEEE (2009)
Qureshi, M.K., Srinivasan, V., Rivers, J.A.: Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Comput. Archit. News 37(3), 24–33 (2009)
Article Google Scholar
Scherer III, W.N., Scott, M.L.: Advanced contention management for dynamic software transactional memory. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, pp. 240–248. ACM (2005)
Shukla, S.B., Agrawal, D.P.: Scheduling pipelined communication in distributed memory multiprocessors for real-time applications. In: ACM SIGARCH Computer Architecture News, Vol. 19, pp. 222–231. ACM (1991)
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Article Google Scholar
Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)
Article MATH Google Scholar
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)
Article Google Scholar
Wang, L., Siegel, H.J., Roychowdhury, V.P., Maciejewski, A.A.: Task matching and scheduling in heterogeneous computing environments using a genetic-algorithm-based approach. J. Parallel Distrib. Comput. 47(1), 8–22 (1997)
Article Google Scholar
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)
Article Google Scholar
Xue, C.J., Hu, J., Shao, Z., Sha, E.: Iterational retiming with partitioning: loop scheduling with complete memory latency hiding. ACM Trans. Embed. Comput. Syst. (TECS) 9(3), 22 (2010)
Google Scholar
Zhong, C., Qu, Z.-Y., Yang, F., Yin, M.-X., Li, X.: Efficient and scalable thread-level parallel algorithms for sorting multisets on multi-core systems. J. Comput. 7(1), 30–41 (2012)
Article Google Scholar
Zhuang, X., Pande, S.: Power-efficient prefetching for embedded processors. ACM Trans. Embed. Comput. Syst. (TECS) 6(1), 3 (2007)
Article Google Scholar
Zivojnovic, V., Velarde, J.M., Schlager, C., Meyr, H.: Dspstone: a DSP-oriented benchmarking methodology. In: Proceedings of ICSPAT 94 (1994)
Zucker, D.F., Lee, R.B., Flynn, M.J.: Hardware and software cache prefetching techniques for mpeg benchmarks. IEEE Trans. Circuits Syst. Video Technol. 10(5), 782–796 (2000)
Article Google Scholar

Download references

Acknowledgments

This research was partially funded by the Key Program of National Natural Science Foundation of China (61133005, 61432005), the National Science Foundation of China (Grant Nos. 61070057, 90715029, 61370095, 61472124), and the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011).

Author information

Authors and Affiliations

School of Computer Science and Educational Software, Guangzhou University, Guangzhou, China
Yan Wang
College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Yan Wang, Kenli Li & Keqin Li

Authors

Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kenli Li
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and Animal Rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Li, K. & Li, K. Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications. Int J Parallel Prog 45, 827–852 (2017). https://doi.org/10.1007/s10766-016-0445-2

Download citation

Received: 10 November 2015
Accepted: 07 July 2016
Published: 15 July 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s10766-016-0445-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Abstract

Access this article

Similar content being viewed by others

Revisiting the Parallel Strategy for DOACROSS Loops

Improvement of Load Balancing in Shared-Memory Multiprocessor Systems

Workload Aware Dynamic Scheduling Algorithm for Multi-core Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and Animal Rights

Informed Consent

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Abstract

Access this article

Similar content being viewed by others

Revisiting the Parallel Strategy for DOACROSS Loops

Improvement of Load Balancing in Shared-Memory Multiprocessor Systems

Workload Aware Dynamic Scheduling Algorithm for Multi-core Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and Animal Rights

Informed Consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation