skip to main content
10.1145/1693453.1693480acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Thread to strand binding of parallel network applications in massive multi-threaded systems

Authors Info & Claims
Published:09 January 2010Publication History

ABSTRACT

In processors with several levels of hardware resource sharing,like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts(strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing.

We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

References

  1. OpenSPARCTM T1 Microarchitecture Specification, 2006.Google ScholarGoogle Scholar
  2. UltraSPARC T1TM Supplement to the UltraSPARC Architecture 2005, 2006.Google ScholarGoogle Scholar
  3. OpenSPARCTM T2 Core Microarchitecture Specification, 2007.Google ScholarGoogle Scholar
  4. OpenSPARCTM T2 System-On-Chip (SOC) Microarchitecture Specification, 2007.Google ScholarGoogle Scholar
  5. Netra Data Plane Software Suite 2.0 Update 2 Reference Manual, 2008Google ScholarGoogle Scholar
  6. Netra Data Plane Software Suite 2.0 Update 2 User's Guide, 2008.Google ScholarGoogle Scholar
  7. Intel 64 and IA-32 Architectures Software Developers Manual, 2009. http://www.intel.com/Assets/PDF/manual/253665.pdf.Google ScholarGoogle Scholar
  8. J. Aas. Understanding the Linux 2.6.8.1 CPU Scheduler. SGI, 2005. doi: http://josh.trancesoftware.com/linux/linux cpu scheduler.pdfGoogle ScholarGoogle Scholar
  9. C. Acosta, F. Cazorla, A. Ramirez, and M. Valero. Thread to Core Assignment in SMT On-Chip Multiprocessors. In SBAC-PAD '09: Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Bovet and M. Cesati. Understanding the Linux Kernel. O'Reilly Media, Inc., 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In HPCA 05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 340--351. IEEE Computer Society, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. G. Cochran. Sampling Techniques, 3rd edition. Wiley-India, 2007. ISBN 8126515244.Google ScholarGoogle Scholar
  13. D. Doucette and A. Fedorova. Base vectors: A potential technique for microarchitectural classification of applications. In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-34, 2007.Google ScholarGoogle Scholar
  14. R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multi-core network processors. In In Compiler Construction, pages 76--90, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating systems scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25--38, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Houston. BGP Table Statistics. http://bgp.potaroo.net.Google ScholarGoogle Scholar
  17. R. Jain, C. Hughes, and S. Adve. Soft real-time scheduling on simultaneous multithreaded processors. Proceedings of RTSS'2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kihm, A. Settle, A. Janiszewski, and D. A. Connors. Understanding the impact of inter-thread cache interference on ILP in modern SMT processors. 7, 2005.Google ScholarGoogle Scholar
  19. E. Kohler, J. Li, V. Paxson, and S. Shenker. Observed Structure of Addresses in IP Traffic. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages 253--266, New York, NY, USA, 2002. ACM. ISBN 1-58113-603-X. doi: http://doi.acm.org/10.1145/637201.637242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kokku, T. L. Richß, A. Kunze, J. Mudigonda, J. Jason, and H. M. Vin. A case for run-time adaptation in packet processing systems. SIGCOMM Comput. Commun. Rev., 34(1):107--112, 2004. ISSN 0146-4833. doi: http://doi.acm.org/10.1145/972374.972393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Kumar, Dean M. Tullsen, Parathasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-ISA heterogenous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st annual international symposium on Computer architecture, page 64, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Shah. Understanding Network Processors. Technical report, EECS, University of California, Berkeley, Sept. 2001.Google ScholarGoogle Scholar
  24. S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors, 2000.Google ScholarGoogle Scholar
  25. P. Radojković, V. Cakarevic, J. Verdú, A. Pajuelo, R. Gioiosa, F. Cazorla, M. Nemirovsky, and M. Valero. Measuring Operating System Overhead on CMT Processors. In SBAC-PAD '08: Proceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2008. ISBN 978-0-7695-3423-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. M. Richard McDougall. Solaris internals: Solaris 10 and OpenSolaris kernel architecture. Sun Microsystems Press/Prentice Hall, 2006. ISBN 9780131482098.Google ScholarGoogle Scholar
  27. D. Shelepov, Juan Carlos Saez Alcaide, Stacey Jeffery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. Hass: A scheduler for heterogeneous multicore systems. In ACM SIGOPS Operating Systems Review, pages 66--75, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Sherwood, G. Varghese, and B. Calder. A Pipelined Memory Architecture for High Throughput Network Processors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 288--299, New York, NY, USA, 2003. ACM. ISBN 0-7695- 1945-8. doi: http://doi.acm.org/10.1145/859618.859652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM J. Res. Dev., 49 (4/5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Snavely, Dean M. Tullsen, and Geoff Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 66--76, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. A. Torrey, J. Coleman, and B. P. Miller. A comparison of interactivity in the Linux 2.6 scheduler and an MLFQ scheduler. Softw. Pract. Exper., 37(4):347--364, 2007. ISSN 0038-0644. doi: http://dx.doi.org/10.1002/spe.v37:4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Čakarević, P. Radojković, J. Verdú A. Pajuelo, F. Cazorla, M. Nemirovsky, and M. Valero. Characterizing the resource-sharing levels in the UltraSPARC T2 processor. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO- 42), New York, NY, USA, Dec 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Wolf, N. Weng, and C.-H. Tai. Design considerations for network processor operating systems. In Proc. of ACM/IEEE Symposium on Architectures for Networking and Communication Systems (ANCS), Princeton, NJ, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Thread to strand binding of parallel network applications in massive multi-threaded systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2010
      372 pages
      ISBN:9781605588773
      DOI:10.1145/1693453
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 5
        PPoPP '10
        May 2010
        346 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1837853
        Issue’s Table of Contents

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader