Skip to main content
Log in

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

This paper addresses the scheduling problem for multi-dimensional loops applications on heterogeneous multicore processors. In the multi-dimensional loops scheduling problem, a significant issue is how to hide memory latency to reduce the schedule length. With the increasing CPU speed, the gap between the processor and memory performance is an important bottleneck for modern high-performance computer systems. To solve the bottleneck problem, a variety of techniques have been studied to hide memory latency from intermediate fast memories (caches) to various prefetching and memory management techniques. Although there are a lot of algorithms in the literature to solve the scheduling with memory management problem for multiprocessor systems, they may not deliver good quality with high performance for heterogeneous multicore processors. In this paper, we first propose a scheduling algorithm Recom_Task_Assign to reduce the write activities to main memory. Then, in conjunction with the Recom_Task_Assign algorithm, we present a new partition scheduling algorithm called heterogeneous multiprocessor partition (HMP) based on the prefetching technique for heterogeneous multicore processors, which can hide memory latencies for applications with multi-dimensional loops. This technique takes advantage of memory access pattern information and fully considers the heterogeneity of processors to achieve high processor utilization. Our HMP algorithm selects the appropriate partition size and shape according to different processors, which increases processor utilization and reduces memory latency. Experiments on DSP benchmarks show that our algorithm can efficiently reduce memory latency and enhance parallelism compared with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bala, K., Kaashoek, M.F., Weihl, W.E.: Software prefetching and caching for translation lookaside buffers. In: Proceedings of the 1st USENIX Conference on Operating Systems Design and Implementation, p. 18. USENIX Association (1994)

  2. Beaumont, O., Boudet, V., Robert, Y., et al.: A realistic model and an efficient heuristic for scheduling with heterogeneous processors (2001)

  3. Belviranli, M.E., Bhuyan, L.N., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. (TACO) 9(4), 57 (2013)

    Google Scholar 

  4. Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: Dague: a generic distributed dag engine for high performance computing. Parallel Comput. 38(1), 37–51 (2012)

    Article  Google Scholar 

  5. Chen, J., Tao, X., Yang, Z., Peir, J.-K., Li, X., Lu, S.-L.: Guided region-based gpu scheduling: utilizing multi-thread parallelism to hide memory latency. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 441–451. IEEE (2013)

  6. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 155–164. ACM (2008)

  7. Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, pp. 223–232. IEEE (1994)

  8. Chen, T.-F., Baer, J.-L.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)

    Article  MATH  Google Scholar 

  9. Chen, Y., Liao, H., Tsai, T.: On-line real-time task scheduling in heterogeneous multi-core system-on-a-chip. IEEE Trans. Parallel Distrib. Syst. 24, 118–130 (2013)

    Article  Google Scholar 

  10. Chu, M., Ravindran, R., Mahlke, S.: Data access partitioning for fine-grain parallelism on multicore architectures. In: MICRO 2007. 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 369–380. IEEE (2007)

  11. Dahlgren, F., Dubois, M., Stenstrom, P.: Sequential hardware prefetching in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 6(7), 733–746 (1995)

    Article  Google Scholar 

  12. Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)

    Article  MATH  Google Scholar 

  13. Eryigit, S., Bayhan, S., Tugcu, T.: Energy-efficient multi-channel cooperative sensing scheduling with heterogeneous channel conditions for cognitive radio networks. IEEE Trans. Veh. Technol. 62, 2690–2699 (2013)

    Article  Google Scholar 

  14. Ganusov, I., Burtscher, M.: Future execution: a hardware prefetching technique for chip multiprocessors. In: 14th International Conference on Parallel Architectures and Compilation Techniques, 2005. PACT 2005, pp. 350–360. IEEE (2005)

  15. Hagras, T., Janeček, J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)

    Article  MATH  Google Scholar 

  16. Hoogerbrugge, J., Terechko, A.: A multithreaded multicore system for embedded media processing. In: Transactions on High-performance Embedded Architectures and Compilers III, pp. 154–173. Springer, Berlin (2011)

  17. http://www.androidheadlines.com/2013/09/samsung-upgrades-exynos-5-to-true-octa-core-status-with-heterogeneous-multi-processing.html (2013)

  18. Hu, J., Xue, C.J., Tseng, W.-C., Zhuge, Q., Sha, E.-M.: Minimizing write activities to non-volatile memory via scheduling and recomputation. In: 2010 IEEE 8th Symposium on Application Specific Processors (SASP), pp. 101–106. IEEE (2010)

  19. Jeong, J., Kim, H., Hwang, J., Lee, J., Maeng, S.: Rigorous rental memory management for embedded systems. ACM Trans. Embed. Comput. Syst. (TECS) 12(1s), 43 (2013)

    Google Scholar 

  20. Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: ACM SIGARCH Computer Architecture News, vol. 19, pp. 43–53. ACM (1991)

  21. Lilja, D.J.: The impact of parallel loop scheduling strategies on prefetching in a shared memory multiprocessor. IEEE Trans. Parallel Distrib. Syst. 5(6), 573–584 (1994)

    Article  Google Scholar 

  22. Liu, G., Abdelrahman, T.: Computation–communication overlap on network-of-workstation multiprocessors. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1635–1642 (1998)

  23. Luk, C.-K.: Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In: Proceedings. 28th Annual International Symposium on Computer Architecture, 2001, pp. 40–51. IEEE (2001)

  24. Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12(2), 87–106 (1991)

    Article  Google Scholar 

  25. Nishiyama, H., Kikuchi, S.: Method for compiling loops containing prefetch instructions that replaces one or more actual prefetches with one virtual prefetch prior to loop scheduling and unrolling, Sept. 7 1999. US Patent 5,950,007

  26. Orlando, S., Perego, R.: Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer. Microprocess. Microprogram. 41(8), 645–658 (1996)

    Article  Google Scholar 

  27. Page, A.J., Naughton, T.J.: Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing. In: Proceedings. 19th IEEE International, Parallel and Distributed Processing Symposium, 2005, p. 189a. IEEE (2005)

  28. Poulsen, D.K., Yew, P.-C.: Data prefetching and data forwarding in shared memory multiprocessors. In: International Conference on Parallel Processing, 1994. ICPP 1994, vol. 2, pp. 280–280. IEEE (1994)

  29. Qiu, M., Liu, M., Hu, F., Liu, S., Wang, L.: Energy aware loop scheduling for high performance multi-module memory. In: Sixth IFIP International Conference on Network and Parallel Computing, 2009. NPC’09, pp. 16–22. IEEE (2009)

  30. Qureshi, M.K., Srinivasan, V., Rivers, J.A.: Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Comput. Archit. News 37(3), 24–33 (2009)

    Article  Google Scholar 

  31. Scherer III, W.N., Scott, M.L.: Advanced contention management for dynamic software transactional memory. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, pp. 240–248. ACM (2005)

  32. Shukla, S.B., Agrawal, D.P.: Scheduling pipelined communication in distributed memory multiprocessors for real-time applications. In: ACM SIGARCH Computer Architecture News, Vol. 19, pp. 222–231. ACM (1991)

  33. Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)

    Article  Google Scholar 

  34. Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)

    Article  MATH  Google Scholar 

  35. Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  36. Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)

    Article  Google Scholar 

  37. Wang, L., Siegel, H.J., Roychowdhury, V.P., Maciejewski, A.A.: Task matching and scheduling in heterogeneous computing environments using a genetic-algorithm-based approach. J. Parallel Distrib. Comput. 47(1), 8–22 (1997)

    Article  Google Scholar 

  38. Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)

    Article  Google Scholar 

  39. Xue, C.J., Hu, J., Shao, Z., Sha, E.: Iterational retiming with partitioning: loop scheduling with complete memory latency hiding. ACM Trans. Embed. Comput. Syst. (TECS) 9(3), 22 (2010)

    Google Scholar 

  40. Zhong, C., Qu, Z.-Y., Yang, F., Yin, M.-X., Li, X.: Efficient and scalable thread-level parallel algorithms for sorting multisets on multi-core systems. J. Comput. 7(1), 30–41 (2012)

    Article  Google Scholar 

  41. Zhuang, X., Pande, S.: Power-efficient prefetching for embedded processors. ACM Trans. Embed. Comput. Syst. (TECS) 6(1), 3 (2007)

    Article  Google Scholar 

  42. Zivojnovic, V., Velarde, J.M., Schlager, C., Meyr, H.: Dspstone: a DSP-oriented benchmarking methodology. In: Proceedings of ICSPAT 94 (1994)

  43. Zucker, D.F., Lee, R.B., Flynn, M.J.: Hardware and software cache prefetching techniques for mpeg benchmarks. IEEE Trans. Circuits Syst. Video Technol. 10(5), 782–796 (2000)

    Article  Google Scholar 

Download references

Acknowledgments

This research was partially funded by the Key Program of National Natural Science Foundation of China (61133005, 61432005), the National Science Foundation of China (Grant Nos. 61070057, 90715029, 61370095, 61472124), and the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and Animal Rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Li, K. & Li, K. Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications. Int J Parallel Prog 45, 827–852 (2017). https://doi.org/10.1007/s10766-016-0445-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0445-2

Keywords

Navigation