skip to main content
research-article

Transparently Space Sharing a Multicore Among Multiple Processes

Published:07 November 2016Publication History
Skip Abstract Section

Abstract

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.

In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.

We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.

Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.

References

  1. T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1, 1, 6--16. DOI:http://dx.doi.org/10.1109/71.80120 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. 1992. Scheduler activations: Effective kernel support for the user-level management of parallelism. ACM Transactions on Computer Systems 10, 1, 53--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). ACM, New York, 119--129. DOI:http://dx.doi.org/10.1145/277651.277678 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Robert D. Blumofe and Dionisios Papadopoulos. 1998. Hood: A User-Level Threads Library for Multiprogrammed Multiprocessors. Technical Report. University of Texas, Austin.Google ScholarGoogle Scholar
  5. Su-Hui Chiang, Rajesh K. Mansharamani, and Mary K. Vernon. 1994. Use of application characteristics and limited preemption for run-to-completion parallel processor scheduling policies. SIGMETRICS Performance Evaluation Review 22, 1, 33--44. DOI:http://dx.doi.org/10.1145/183019.183023 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. B. Downey. 1997. A Model for Speedup of Parallel Programs. Computer Science Division, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-933.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mary W. Hall and Margaret Martonosi. 1998. Adaptive parallelism in compiler-parallelized code. Concurrency: Practice and Experience 10, 14, 1235--1250. DOI:http://dx.doi.org/10.1002/(SICI)1096-9128(19981210)10:14<1235::AID-CPE373>3.0.CO;2-ZGoogle ScholarGoogle ScholarCross RefCross Ref
  9. Jan Hungershöfer, Achim Streit, and Jens-Michael Wierum. 2001. Efficient Resource Management for Malleable Applications. Technical Report TR-003-01. Paderson Center for Parallel Computing.Google ScholarGoogle Scholar
  10. L. V. Kale, S. Kumar, and J. DeSouza. 2002. A malleable-job system for timeshared parallel machines. In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, Los Alamitos, CA, 230. DOI:http://dx.doi.org/10.1109/CCGRID.2002.1017131 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. H. Kazi and D. J. Lilja. 2000. A comprehensive dynamic processor allocation scheme for multiprogrammed multiprocessor systems. In Proceedings of the 2000 International Conference on Parallel Processing IEEE, Los Alamitos, CA, 153--161. DOI:http://dx.doi.org/10.1109/ICPP.2000.876103 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daniel James McFarland. 2011. Exploiting Malleable Parallelism on Multicore Systems. University Libraries, Virginia Polytechnic Institute and State University, Blacksburg, VA. http://scholar.lib.vt.edu/theses/available/etd-06292011-130247Google ScholarGoogle Scholar
  13. OMQ Community. 2012. ZeroMQ. Retrieved October 11, 2016, from http://zero.mq/.Google ScholarGoogle Scholar
  14. Heidi Pan, Benjamin Hindman, and Krste Asanović. 2010. Composing parallel software efficiently with lithe. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, 376--387. DOI:http://dx.doi.org/10.1145/1806596.1806639 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. PAPI Team. 2012. PAPI: Performance Application Programming Interface Retrieved October 11, 2016, from http://icl.cs.utk.edu/papi/.Google ScholarGoogle Scholar
  16. Karl Pearson. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302, 157--175.Google ScholarGoogle ScholarCross RefCross Ref
  17. Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. 2012. Parcae: A system for flexible parallel execution. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jan H. Schonherr, Jan Richling, and Hans-Ulrich Heiss. 2010. Dynamic teams in OpenMP. In Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’10). IEEE, Los Alamitos, CA, 231--237. DOI:http://dx.doi.org/10.1109/SBAC-PAD.2010.36 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive, efficient, parallel execution of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Sudarsan and C. J. Ribbens. 2007. ReSHAPE: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment. In Proceedings of the International Conference on Parallel Processing (ICPP’07). 44. DOI:http://dx.doi.org/10.1109/ICPP.2007.73 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Aater Suleman, Moinuddin K. Qureshi, and Yale N. Patt. 2008. Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 277--286. DOI:http://dx.doi.org/10.1145/1346281.1346317 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Tucker and A. Gupta. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, 159--166. DOI:http://dx.doi.org/10.1145/74850.74866 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1, 6, 80--83.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Transparently Space Sharing a Multicore Among Multiple Processes

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 3, Issue 3
    December 2016
    145 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3012407
    Issue’s Table of Contents

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 7 November 2016
    • Accepted: 1 September 2016
    • Revised: 1 January 2016
    • Received: 1 October 2014
    Published in topc Volume 3, Issue 3

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader