Understanding Co-run Degradations on Integrated Heterogeneous Processors

Zhu, Qi; Wu, Bo; Shen, Xipeng; Shen, Li; Wang, Zhiying

doi:10.1007/978-3-319-17473-0_6

Qi Zhu¹⁵,
Bo Wu¹⁶,
Xipeng Shen¹⁷,
Li Shen¹⁵ &
…
Zhiying Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8967))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

957 Accesses
2 Citations

Abstract

Co-runs of independent applications on systems with heterogeneous processors are common (data centers, mobile devices, etc.). There has been limited understanding on the influence of co-runners on such systems. The previous studys on this topic are on simulators with limited settings.

In this work, we conduct a comprehensive investigation of the performance of co-running jobs on integrated heterogeneous processors. The investigation produces a list of interesting and counter-intuitive findings. It reveals some critical design issues in modern operating systems in supporting heterogeneous processors, and suggests some potential solutions at the levels of program transformation and OS design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Markatos, E.P., LeBlanc, T.J.: Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 5(4), 379–400 (1994)
Article Google Scholar
Squillante, M.S., Lazowska, E.D.: Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 4(2), 131–143 (1993)
Article Google Scholar
Gelado, I., Stone, J.E., Cabezas, J., et al.: An asymmetric distributed shared memory model for heterogeneous parallel systems. ACM SIGARCH Comput. Archit. News (ACM) 38(1), 347–358 (2010)
Article Google Scholar
George, V., Engineer, S.P., Piazza, T., et al.: Technology Insight: Intel Next Generation Microarchitecture Codename Ivy Bridge (2011)
Google Scholar
Amd, APP SDK 2.4. http://developer.amd.com/amd-license-agreement/?f=AMD-APP-SDK-v2.4-Windows-64.exe
Jiang, Y., Shen, X., Chen, J., et al.: Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 220–229. ACM (2008)
Google Scholar
Tian, K., Jiang, Y., Shen, X.: A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Conference on Computing Frontiers, pp. 41–50. ACM (2009)
Google Scholar
Jiang, Y., Tian, K., Shen, X.: Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 201–215. Springer, Heidelberg (2010)
Chapter Google Scholar
Fedorova, A., Seltzer, M., Smith, M.D.: Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp. 25–38. IEEE Computer Society (2007)
Google Scholar
El-Moursy, A., Garg, R., Albonesi, D.H., et al.: Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), p. 10. IEEE (2006)
Google Scholar
Grewe, D., Wang, Z., O’Boyle, M.F.P.: OpenCL task partitioning in the presence of GPU contention. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 87–101. Springer, Heidelberg (2014)
Chapter Google Scholar
Luk, C.K., Hong, S., Qilin, K.H.: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), pp. 45–55. IEEE (2009)
Google Scholar
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 286–305. Springer, Heidelberg (2011)
Chapter Google Scholar
Ravi, V.T., Ma, W., Chiu, D., et al.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, pp. 137–146. ACM (2010)
Google Scholar
Mekkat, V., Holey, A., Yew, P.C., et al.: Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 225–234. IEEE Press (2013)
Google Scholar
Liu, Y., Zhang, E.Z., Shen, X.: A cross-input adaptive framework for GPU program optimizations. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), pp. 1–10. IEEE (2009)
Google Scholar
Tuck, N., Tullsen, D.M.: Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003), pp. 26–34. IEEE (2003)
Google Scholar
Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. ACM SIGPLAN Not. (ACM) 38(5), 245–257 (2003)
Article MathSciNet Google Scholar
Fousek, J., Filipovi, J., Madzin, M.: Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Comput. Archit. News 39(4), 98–99 (2011)
Article Google Scholar
Wang, G., Lin, Y.S., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Cyber, Physical and Social Computing (CPSCom), Green Computing and Communications (GreenCom), pp. 344–350. IEEE (2010)
Google Scholar
Wu, H., Diamos, G., Wang, J., et al.: Optimizing data warehousing applications for GPUs using kernel fusion, fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 2433–2442. IEEE (2012)
Google Scholar
Aila, T., Laine, S.: Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics, pp. 145–149. ACM (2009)
Google Scholar
Chen, L., Villa, O., Krishnamoorthy, S., et al.: Dynamic load balancing on single-and multi-GPU systems. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)
Google Scholar
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)
Google Scholar
Xiao, S., Feng, W.: Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)
Google Scholar
http://unixhelp.ed.ac.uk/CGI/man-cgi?sched_setscheduler+2
Zahedi, S.M., Lee, B.C.: REF: resource elasticity fairness with sharing incentives for multiprocessors. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2014)
Google Scholar
Mars, J., Tang, L., Hundt, R.: Whare-Map: heterogeneity in homogeneous warehouse-scale computers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12 (2013)
Google Scholar
Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? ACM Sigplan Not. (ACM) 45(5), 203–212 (2010)
Article Google Scholar
Chang, J., Sohi, G.S.: Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 21st Annual International Conference on Supercomputing, pp. 242–252. ACM (2007)
Google Scholar
Rafique, N., Lim, W.T., Thottethodi, M.: Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pp. 2–12. ACM (2006)
Google Scholar
Suh, G.E., Devadas, S., Rudolph, L.: A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pp. 117–128. IEEE (2002)
Google Scholar
Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 423–432. IEEE Computer Society (2006)
Google Scholar

Download references

Acknowledgments

We thank the reviewers for the helpful comments. This material is based upon work supported by DOE Early Career Award and the National Science Foundation (NSF) under Grant No. 1320796 and CAREER Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE or NSF. This work is also partially supported by 863 Program of China (2012AA010905), NSFC (61272144, 61272143) and NUDT/Hunan Innov. Fund. For PostGrad. (B120604, CX2012B029).

Author information

Authors and Affiliations

National Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha Hunan, 410073, China
Qi Zhu, Li Shen & Zhiying Wang
EECS, Colorado School of Mines, Golden, CO, 80401, USA
Bo Wu
Department of Computer Science, North Carolina State University, Raleigh, NC, 27695, USA
Xipeng Shen

Authors

Qi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xipeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Li Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Zhu .

Editor information

Editors and Affiliations

Intel Corporation, Santa Clara, California, USA
James Brodman
Intel Corporation, Santa Clara, California, USA
Peng Tu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z. (2015). Understanding Co-run Degradations on Integrated Heterogeneous Processors. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-17473-0_6
Published: 01 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics