Abstract
Co-runs of independent applications on systems with heterogeneous processors are common (data centers, mobile devices, etc.). There has been limited understanding on the influence of co-runners on such systems. The previous studys on this topic are on simulators with limited settings.
In this work, we conduct a comprehensive investigation of the performance of co-running jobs on integrated heterogeneous processors. The investigation produces a list of interesting and counter-intuitive findings. It reveals some critical design issues in modern operating systems in supporting heterogeneous processors, and suggests some potential solutions at the levels of program transformation and OS design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Markatos, E.P., LeBlanc, T.J.: Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 5(4), 379–400 (1994)
Squillante, M.S., Lazowska, E.D.: Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 4(2), 131–143 (1993)
Gelado, I., Stone, J.E., Cabezas, J., et al.: An asymmetric distributed shared memory model for heterogeneous parallel systems. ACM SIGARCH Comput. Archit. News (ACM) 38(1), 347–358 (2010)
George, V., Engineer, S.P., Piazza, T., et al.: Technology Insight: Intel Next Generation Microarchitecture Codename Ivy Bridge (2011)
Amd, APP SDK 2.4. http://developer.amd.com/amd-license-agreement/?f=AMD-APP-SDK-v2.4-Windows-64.exe
Jiang, Y., Shen, X., Chen, J., et al.: Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 220–229. ACM (2008)
Tian, K., Jiang, Y., Shen, X.: A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Conference on Computing Frontiers, pp. 41–50. ACM (2009)
Jiang, Y., Tian, K., Shen, X.: Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 201–215. Springer, Heidelberg (2010)
Fedorova, A., Seltzer, M., Smith, M.D.: Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp. 25–38. IEEE Computer Society (2007)
El-Moursy, A., Garg, R., Albonesi, D.H., et al.: Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), p. 10. IEEE (2006)
Grewe, D., Wang, Z., O’Boyle, M.F.P.: OpenCL task partitioning in the presence of GPU contention. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 87–101. Springer, Heidelberg (2014)
Luk, C.K., Hong, S., Qilin, K.H.: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), pp. 45–55. IEEE (2009)
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 286–305. Springer, Heidelberg (2011)
Ravi, V.T., Ma, W., Chiu, D., et al.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, pp. 137–146. ACM (2010)
Mekkat, V., Holey, A., Yew, P.C., et al.: Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 225–234. IEEE Press (2013)
Liu, Y., Zhang, E.Z., Shen, X.: A cross-input adaptive framework for GPU program optimizations. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), pp. 1–10. IEEE (2009)
Tuck, N., Tullsen, D.M.: Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003), pp. 26–34. IEEE (2003)
Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. ACM SIGPLAN Not. (ACM) 38(5), 245–257 (2003)
Fousek, J., Filipovi, J., Madzin, M.: Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Comput. Archit. News 39(4), 98–99 (2011)
Wang, G., Lin, Y.S., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Cyber, Physical and Social Computing (CPSCom), Green Computing and Communications (GreenCom), pp. 344–350. IEEE (2010)
Wu, H., Diamos, G., Wang, J., et al.: Optimizing data warehousing applications for GPUs using kernel fusion, fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 2433–2442. IEEE (2012)
Aila, T., Laine, S.: Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics, pp. 145–149. ACM (2009)
Chen, L., Villa, O., Krishnamoorthy, S., et al.: Dynamic load balancing on single-and multi-GPU systems. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)
Xiao, S., Feng, W.: Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)
Zahedi, S.M., Lee, B.C.: REF: resource elasticity fairness with sharing incentives for multiprocessors. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2014)
Mars, J., Tang, L., Hundt, R.: Whare-Map: heterogeneity in homogeneous warehouse-scale computers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12 (2013)
Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? ACM Sigplan Not. (ACM) 45(5), 203–212 (2010)
Chang, J., Sohi, G.S.: Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 21st Annual International Conference on Supercomputing, pp. 242–252. ACM (2007)
Rafique, N., Lim, W.T., Thottethodi, M.: Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pp. 2–12. ACM (2006)
Suh, G.E., Devadas, S., Rudolph, L.: A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pp. 117–128. IEEE (2002)
Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 423–432. IEEE Computer Society (2006)
Acknowledgments
We thank the reviewers for the helpful comments. This material is based upon work supported by DOE Early Career Award and the National Science Foundation (NSF) under Grant No. 1320796 and CAREER Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE or NSF. This work is also partially supported by 863 Program of China (2012AA010905), NSFC (61272144, 61272143) and NUDT/Hunan Innov. Fund. For PostGrad. (B120604, CX2012B029).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z. (2015). Understanding Co-run Degradations on Integrated Heterogeneous Processors. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-17473-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)