Abstract
This paper addresses the scheduling problem for multi-dimensional loops applications on heterogeneous multicore processors. In the multi-dimensional loops scheduling problem, a significant issue is how to hide memory latency to reduce the schedule length. With the increasing CPU speed, the gap between the processor and memory performance is an important bottleneck for modern high-performance computer systems. To solve the bottleneck problem, a variety of techniques have been studied to hide memory latency from intermediate fast memories (caches) to various prefetching and memory management techniques. Although there are a lot of algorithms in the literature to solve the scheduling with memory management problem for multiprocessor systems, they may not deliver good quality with high performance for heterogeneous multicore processors. In this paper, we first propose a scheduling algorithm Recom_Task_Assign to reduce the write activities to main memory. Then, in conjunction with the Recom_Task_Assign algorithm, we present a new partition scheduling algorithm called heterogeneous multiprocessor partition (HMP) based on the prefetching technique for heterogeneous multicore processors, which can hide memory latencies for applications with multi-dimensional loops. This technique takes advantage of memory access pattern information and fully considers the heterogeneity of processors to achieve high processor utilization. Our HMP algorithm selects the appropriate partition size and shape according to different processors, which increases processor utilization and reduces memory latency. Experiments on DSP benchmarks show that our algorithm can efficiently reduce memory latency and enhance parallelism compared with existing methods.
Similar content being viewed by others
References
Bala, K., Kaashoek, M.F., Weihl, W.E.: Software prefetching and caching for translation lookaside buffers. In: Proceedings of the 1st USENIX Conference on Operating Systems Design and Implementation, p. 18. USENIX Association (1994)
Beaumont, O., Boudet, V., Robert, Y., et al.: A realistic model and an efficient heuristic for scheduling with heterogeneous processors (2001)
Belviranli, M.E., Bhuyan, L.N., Gupta, R.: A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim. (TACO) 9(4), 57 (2013)
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: Dague: a generic distributed dag engine for high performance computing. Parallel Comput. 38(1), 37–51 (2012)
Chen, J., Tao, X., Yang, Z., Peir, J.-K., Li, X., Lu, S.-L.: Guided region-based gpu scheduling: utilizing multi-thread parallelism to hide memory latency. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 441–451. IEEE (2013)
Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 155–164. ACM (2008)
Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, pp. 223–232. IEEE (1994)
Chen, T.-F., Baer, J.-L.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)
Chen, Y., Liao, H., Tsai, T.: On-line real-time task scheduling in heterogeneous multi-core system-on-a-chip. IEEE Trans. Parallel Distrib. Syst. 24, 118–130 (2013)
Chu, M., Ravindran, R., Mahlke, S.: Data access partitioning for fine-grain parallelism on multicore architectures. In: MICRO 2007. 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007, pp. 369–380. IEEE (2007)
Dahlgren, F., Dubois, M., Stenstrom, P.: Sequential hardware prefetching in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 6(7), 733–746 (1995)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Eryigit, S., Bayhan, S., Tugcu, T.: Energy-efficient multi-channel cooperative sensing scheduling with heterogeneous channel conditions for cognitive radio networks. IEEE Trans. Veh. Technol. 62, 2690–2699 (2013)
Ganusov, I., Burtscher, M.: Future execution: a hardware prefetching technique for chip multiprocessors. In: 14th International Conference on Parallel Architectures and Compilation Techniques, 2005. PACT 2005, pp. 350–360. IEEE (2005)
Hagras, T., Janeček, J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)
Hoogerbrugge, J., Terechko, A.: A multithreaded multicore system for embedded media processing. In: Transactions on High-performance Embedded Architectures and Compilers III, pp. 154–173. Springer, Berlin (2011)
Hu, J., Xue, C.J., Tseng, W.-C., Zhuge, Q., Sha, E.-M.: Minimizing write activities to non-volatile memory via scheduling and recomputation. In: 2010 IEEE 8th Symposium on Application Specific Processors (SASP), pp. 101–106. IEEE (2010)
Jeong, J., Kim, H., Hwang, J., Lee, J., Maeng, S.: Rigorous rental memory management for embedded systems. ACM Trans. Embed. Comput. Syst. (TECS) 12(1s), 43 (2013)
Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: ACM SIGARCH Computer Architecture News, vol. 19, pp. 43–53. ACM (1991)
Lilja, D.J.: The impact of parallel loop scheduling strategies on prefetching in a shared memory multiprocessor. IEEE Trans. Parallel Distrib. Syst. 5(6), 573–584 (1994)
Liu, G., Abdelrahman, T.: Computation–communication overlap on network-of-workstation multiprocessors. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1635–1642 (1998)
Luk, C.-K.: Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In: Proceedings. 28th Annual International Symposium on Computer Architecture, 2001, pp. 40–51. IEEE (2001)
Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12(2), 87–106 (1991)
Nishiyama, H., Kikuchi, S.: Method for compiling loops containing prefetch instructions that replaces one or more actual prefetches with one virtual prefetch prior to loop scheduling and unrolling, Sept. 7 1999. US Patent 5,950,007
Orlando, S., Perego, R.: Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer. Microprocess. Microprogram. 41(8), 645–658 (1996)
Page, A.J., Naughton, T.J.: Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing. In: Proceedings. 19th IEEE International, Parallel and Distributed Processing Symposium, 2005, p. 189a. IEEE (2005)
Poulsen, D.K., Yew, P.-C.: Data prefetching and data forwarding in shared memory multiprocessors. In: International Conference on Parallel Processing, 1994. ICPP 1994, vol. 2, pp. 280–280. IEEE (1994)
Qiu, M., Liu, M., Hu, F., Liu, S., Wang, L.: Energy aware loop scheduling for high performance multi-module memory. In: Sixth IFIP International Conference on Network and Parallel Computing, 2009. NPC’09, pp. 16–22. IEEE (2009)
Qureshi, M.K., Srinivasan, V., Rivers, J.A.: Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Comput. Archit. News 37(3), 24–33 (2009)
Scherer III, W.N., Scott, M.L.: Advanced contention management for dynamic software transactional memory. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of Distributed Computing, pp. 240–248. ACM (2005)
Shukla, S.B., Agrawal, D.P.: Scheduling pipelined communication in distributed memory multiprocessors for real-time applications. In: ACM SIGARCH Computer Architecture News, Vol. 19, pp. 222–231. ACM (1991)
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)
Wang, L., Siegel, H.J., Roychowdhury, V.P., Maciejewski, A.A.: Task matching and scheduling in heterogeneous computing environments using a genetic-algorithm-based approach. J. Parallel Distrib. Comput. 47(1), 8–22 (1997)
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)
Xue, C.J., Hu, J., Shao, Z., Sha, E.: Iterational retiming with partitioning: loop scheduling with complete memory latency hiding. ACM Trans. Embed. Comput. Syst. (TECS) 9(3), 22 (2010)
Zhong, C., Qu, Z.-Y., Yang, F., Yin, M.-X., Li, X.: Efficient and scalable thread-level parallel algorithms for sorting multisets on multi-core systems. J. Comput. 7(1), 30–41 (2012)
Zhuang, X., Pande, S.: Power-efficient prefetching for embedded processors. ACM Trans. Embed. Comput. Syst. (TECS) 6(1), 3 (2007)
Zivojnovic, V., Velarde, J.M., Schlager, C., Meyr, H.: Dspstone: a DSP-oriented benchmarking methodology. In: Proceedings of ICSPAT 94 (1994)
Zucker, D.F., Lee, R.B., Flynn, M.J.: Hardware and software cache prefetching techniques for mpeg benchmarks. IEEE Trans. Circuits Syst. Video Technol. 10(5), 782–796 (2000)
Acknowledgments
This research was partially funded by the Key Program of National Natural Science Foundation of China (61133005, 61432005), the National Science Foundation of China (Grant Nos. 61070057, 90715029, 61370095, 61472124), and the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and Animal Rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Rights and permissions
About this article
Cite this article
Wang, Y., Li, K. & Li, K. Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications. Int J Parallel Prog 45, 827–852 (2017). https://doi.org/10.1007/s10766-016-0445-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-016-0445-2