skip to main content
research-article

SLITS: Sparsity-Lightened Intelligent Thread Scheduling

Published: 02 March 2023 Publication History

Abstract

A diverse set of scheduling objectives (e.g., resource contention, fairness, priority, etc.) breed a series of objective-specific schedulers for multi-core architectures. Existing designs incorporate thread-to-thread statistics at runtime, and schedule threads based on such an abstraction (we formalize thread-to-thread interaction as the Thread-Interaction Matrix). However, such an abstraction also reveals a consistently-overlooked issue: the Thread-Interaction Matrix (TIM) is highly sparse. Therefore, existing designs can only deliver sub-optimal decisions, since the sparsity issue limits the amount of thread permutations (and its statistics) to be exploited when performing scheduling decisions.
We introduce Sparsity-Lightened Intelligent Thread Scheduling (SLITS), a general scheduler design for mitigating the sparsity issue of TIM, with the customizability for different scheduling objectives. SLITS is designed upon the key insight that: the sparsity issue of the TIM can be effectively mitigated via advanced Machine Learning (ML) techniques. SLITS has three components. First, SLITS profiles Thread Interactions for only a small number of thread permutations, and form the TIM using the run-time statistics. Second, SLITS estimates the missing values in the TIM using Factorization Machine (FM), a novel ML technique that can fill in the missing values within a large-scale sparse matrix based on the limited information. Third, SLITS leverages Lazy Reschedule, a general mechanism as the building block for customizing different scheduling policies for different scheduling objectives. We show how SLITS can be (1) customized for different scheduling objectives, including resource contention and fairness; and (2) implemented with only negligible hardware costs. We also discuss how SLITS can be potentially applied to other contexts of thread scheduling.
We evaluate two SLITS variants against four state-of-the-art scheduler designs. We highlight that, averaged across 11 benchmarks, SLITS achieves an average speedup of 1.08X over the de facto standard for thread scheduler - the Completely Fair Scheduler, under the 16-core setting for a variety of number of threads (i.e., 32, 64 and 128). Our analysis reveals that the benefits of SLITS are credited to significant improvements of cache utilization. In addition, our experimental results confirm that SLITS is scalable and the benefits are robust when of the number of threads increases. We also perform extensive studies to (1) break down SLITS components to justify the synergy of our design choices, (2) examine the impacts of varying the estimation coverage of FM, (3) justify the necessity of Lazy Reschedule rather than periodic rescheduling, and (4) demonstrate the hardware overheads for SLITS implementations can be marginal (<1% chip area and power).

References

[1]
"Intel Xeon Gold 6150,"https://en.wikichip.org/wiki/intel/xeon_gold/6150.
[2]
"Intel Xeon Platinum 8180M,"https://en.wikichip.org/wiki/intel/xeon_platinum/8180m.
[3]
Xlearn. [Online]. Available: https://github.com/aksnzhy/xlearn
[4]
"Chapter 6 - Performance,"in Principles of Computer System Design, 2009.
[5]
I. Akturk and O. Ozturk,"Adaptive Thread Scheduling in Chip Multiprocessors," International Journal of Parallel Programming, 2019.
[6]
Z. Al-bayati, H. Zeng, M. Di Natale, and Z. Gu, "Multitask Implementation of Synchronous Reactive Models with Earliest Deadline First Scheduling,"in SIES, 2013.
[7]
S. Aldrich,"Recommender Systems in Commercial Use,"AI Magazine, vol. 32, pp. 28--34, 09 2011.
[8]
AMD, "BIOS and Kernel Developer's Guide for AMD Family 15h processors,"2013. [Online]. Available: https://www.amd.com/system/files/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
[9]
J. H. Anderson and J. M. Calandrino, "Parallel Real-Time Task Scheduling on Multicore Platforms,"in RTSS, 2006.
[10]
J. H. Anderson and A. Srinivasan,"Pfair Scheduling: Beyond Periodic Task Systems,"in Proceedings Seventh International Conference on Real-Time Computing Systems and Applications, 2000.
[11]
J. Anderson and A. Srinivasan, "Mixed Pfair/ERfair Scheduling of Asynchronous Periodic Tasks,"Journal of Computer and System Sciences, 2004.
[12]
J. Anderson and A. Srinivasan, "Early-release Fair Scheduling,"in Euromicro RTS, 2000.
[13]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton, "Thread Scheduling for Multiprogrammed Multiprocessors,"in SPAA, 1998.
[14]
G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan,"Memory Hierarchy for Web Search,"in HPCA, 2018.
[15]
S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel,"Proportionate Progress: A Notion of Fairness in Resource Allocation,"in STOC, 1993.
[16]
S. Baruah, J. Gehrke, and C. Plaxton,"Fast Scheduling of Periodic Tasks on Multiple Resources,"in IPSS, 1995.
[17]
A. Baumann, P. Barham, P. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, "The Multikernel: A New OS Architecture for Scalable Multicore Systems,"in SOSP, 2009.
[18]
N. Beckmann and D. Sanchez, "Jigsaw: Scalable Software-Defined Caches,"in PACT, 2013.
[19]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications,"in PACT, 2008.
[20]
J. R. Bulpin and I. A. Pratt, "Hyper-Threading Aware Process Scheduling Heuristics,"in USENIX ATC, 2005.
[21]
J. M. Calandrino and J. H. Anderson,"On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler," in Euromicro RTS, 2009.
[22]
T. E. Carlson, W. Heirman, and L. Eeckhout, "SNIPER: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations,"in SC, 2011.
[23]
R. Cayssials, J. Orozco, J. Santos, and R. Santos,"Rate Monotonic Scheduling of Real-time Control Systems with the Minimum Number of Priority levels," 1999.
[24]
H. S. Chwa, J. Lee, J. Lee, K.-M. Phan, A. Easwaran, and I. Shin, "Global EDF Schedulability Analysis for Parallel Tasks on Multi-Core Platforms," IEEE TPDS, 2017.
[25]
J. A. Colmenares, S. Bird, G. Eads, S. A. Hofmeyr, A. Kim, R. Poddar, H. Alkaff, K. Asanovic, and J. Kubiatowicz,"Tessel- lation Operating System: Building a Real-Time, Responsive, High-Throughput Client OS for Many-core Architectures," in IEEE HotChips, 2011.
[26]
K. V. Craeynest, A. Jaleel, L. Eeckhout, P. Narváez, and J. S. Emer,"Scheduling Heterogeneous Multi-cores through Performance Impact Estimation (PIE),"in ISCA, 2012.
[27]
C. Delimitrou and C. Kozyrakis, "Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,"2013.
[28]
X. Ding, K. Wang, P. B. Gibbons, and X. Zhang, "BWS: Balanced Work Stealing for Time-Sharing Multicores,"in EuroSys, 2012.
[29]
S. Eyerman and L. Eeckhout,"Per-Thread Cycle Accounting in SMT Processors,"in ASPLOS, 2009.
[30]
S. Eyerman and L. Eeckhout,"Probabilistic Job Symbiosis Modeling for SMT Processor Scheduling,"2010.
[31]
A. Fedorova, M. Seltzer, and M. D. Smith,"Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler,"in PACT, 2007.
[32]
J. Feliu, J. Sahuquillo, S. Petit, and L. Eeckhout,"Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors,"IEEE Trans. Parallel Distrib. Syst., 2020.
[33]
J. Feliu, J. Sahuquillo, S. Petit, and J. Duato,"L1-bandwidth Aware Thread Allocation in Multicore SMT Processors,"in PACT, 2013.
[34]
X. Geng, G. Xu, D. Wang, and Y. Shi,"A Task Scheduling Algorithm based on Multi-core Processors,"in MEC, 2011.
[35]
P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. G. Hallnor, H. Jiang, M. G. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton, "Haswell: The Fourth-Generation Intel Core Processor,"IEEE Micro, 2014.
[36]
R. Hemani, S. Banerjee, and A. Guha, "Easy and Expressive LLC Contention Model,"in HPCS, 2016.
[37]
G. J. Henry,"The UNIX system: The Fair Share Scheduler,"AT&T Bell Laboratories Technical Journal, 1984.
[38]
C. Huang and V. Nagarajan,"ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache,"in PACT, 2014.
[39]
Intel,"Intel 64 and IA-32 Architecture Software Developer Manual,"2014.
[40]
P. N. Jain and S. K. Surve,"A Review on Shared Resource Contention in Multicores and its Mitigating Techniques," IJHPSA, 2020.
[41]
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi,"Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache,"in MICRO, 2014.
[42]
D. Jevdjic, S. Volos, and B. Falsafi,"Die-stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache," in ISCA, 2013.
[43]
K. Kaffes, J. T. Humphries, D. Mazières, and C. Kozyrakis,"Syrup: User-Defined Scheduling Across the Stack," in SOSP, 2021.
[44]
S. Kanev, J. P. Darago, K. M. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. M. Brooks,"Profiling a Warehouse- Scale Computer," IEEE Micro, 2016.
[45]
J. Kay and P. Lauder,"A Fair Share Scheduler,"CACM, 1988.
[46]
H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt,"Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors,"in CGO, 2007.
[47]
S. Kim, D. Chandra, and Y. Solihin,"Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,"in PACT, 2004.
[48]
R. Kuchumov, A. S. Sokolov, and V. Korkhov,"Staccato: Shared-Memory Work-Stealing Task Scheduler with Cache-aware Memory Management,"IJWGS, 2019.
[49]
N. Kulkarni, G. Gonzalez-Pumariega, A. Khurana, C. A. Shoemaker, C. Delimitrou, and D. H. Albonesi, "CuttleSys: Data-Driven Resource Management for Interactive Services on Reconfigurable Multicores,"in MICRO, 2020.
[50]
C. V. Li, V. Petrucci, and D. Mossé,"Predicting Thread Profiles across Core Types via Machine Learning on Heterogeneous Multiprocessors,"in SBESC, 2016.
[51]
C. V. Li, V. Petrucci, and D. Mossé, "Exploring Machine Learning for Thread Characterization on Heterogeneous Multiprocessors,"ACM OSR, 2017.
[52]
J. Li, Z. Luo, D. Ferry, K. Agrawal, C. Gill, and C. Lu,"Global EDF Scheduling for Parallel Real-Time Tasks,"in Springer RTS, 2015.
[53]
C. Lin, T. Huang, and M. D. F. Wong,"An Efficient Work-Stealing Scheduler for Task Dependency Graph,"in ICPADS, 2020.
[54]
D. Liu and Y. Lee,"Pfair Scheduling of Periodic Tasks with Allocation Constraints on Multiple Processors,"in IPDPS, 2004.
[55]
G. H. Loh and M. D. Hill, "Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches," in MICRO, 2011.
[56]
J. Lozi, B. Lepers, J. R. Funston, F. Gaud, V. Quéma, and A. Fedorova,"The Linux Scheduler: A Decade of Wasted Cores," in EuroSys, 2016.
[57]
C. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,"in PLDI, 2005.
[58]
C. Mattihalli, "Designing and Implementing of Earliest Deadline First Scheduling Algorithm on Standard Linux,"in CPSCom, P. Zhu, L. Wang, F. Xia, H. Chen, I. McLoughlin, S. Tsao, M. Sato, S. Chai, and I. King, Eds., 2010.
[59]
N. Mishra, J. D. Lafferty, and H. Hoffmann,"ESP: A Machine Learning Approach to Predicting Application Interference," in ICAC, 2017.
[60]
A. Negi and P. K. Kumar, "Applying Machine Learning Techniques to Improve Linux Process Scheduling,"in TENCON, 2005.
[61]
D. Nemirovsky, T. Arkose, N. Markovic, M. Nemirovsky, O. S. Unsal, and A. Cristal,"A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs,"in SBAC-PAD, 2017.
[62]
K. Pearson,"Note on Regression and Inheritance in the Case of Two Parents,"Royal Society of London, 1895.
[63]
Y. Peng, S. Wu, and H. Jin, "Robinhood: Towards Efficient Work-Stealing in Virtualized Environments,"IEEE TPDS, 2016.
[64]
M. K. Qureshi and G. H. Loh,"Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design,"in MICRO, 2012.
[65]
P. Radojkovic, V. Cakarevic, M. Moretó, J. Verdú, A. Pajuelo, F. J. Cazorla, M. Nemirovsky, and M. Valero, "Optimal Task Assignment in Multithreaded Processors: A Statistical Approach,"in ASPLOS, 2012.
[66]
P. Radojkovic, P. M. Carpenter, M. Moretó, V. Cakarevic, J. Verdú, A. Pajuelo, F. J. Cazorla, M. Nemirovsky, and M. Valero, "Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach,"IEEE TC, 2016.
[67]
S. Rendle, "Factorization Machines,"in ICDM, 2010.
[68]
A. Saifullah, K. Agrawal, C. Lu, and C. Gill,"Multi-core Real-Time Scheduling for Generalized Parallel Task Models,"in RTSS, 2011.
[69]
J. Sim, G. H. Loh, H. Kim, M. O'Connor, and M. Thottethodi,"A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch,"in MICRO, 2012.
[70]
A. Skiadopoulos, Q. Li, P. Kraft, K. Kaffes, D. Hong, S. Mathew, D. Bestor, M. Cafarella, V. Gadepally, G. Graefe, J. Kepner, C. Kozyrakis, T. Kraska, M. Stonebraker, L. Suresh, and M. Zaharia, "DBOS: A DBMS-Oriented Operating System," PVLDB, 2021.
[71]
A. Snavely and D. M. Tullsen, "Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,"in ASPLOS, 2000.
[72]
S. Srikanthan, S. Dwarkadas, and K. Shen,"Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems,"in ATC, 2015.
[73]
A. Srinivasan and J. H. Anderson, "Optimal Rate-based Scheduling on Multiprocessors,"JCSS, 2006.
[74]
W. J. Starke, J. Stuecheli, D. Daly, J. S. Dodson, F. Auernhammer, P. Sagmeister, G. L. Guthrie, C. F. Marino, M. S. Siegel, and B. Blaner,"The Cache and Memory Subsystems of the IBM POWER8 Processor,"IBM JRD, 2015.
[75]
D. K. Tam, R. Azimi, and M. Stumm,"Thread Clustering: Sharing-aware Scheduling on SMP-CMP-SMT Multiprocessors," in EuroSys, 2007.
[76]
S. Thoziyoor, J. H. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi,"A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies,"in ISCA, 2008.
[77]
S. Volos, D. Jevdjic, B. Falsafi, and B. Grot,"Fat Caches for Scale-Out Servers,"IEEE Micro, 2017.
[78]
D. Wentzlaff, C. G. III, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. E. Miller, and A. Agarwal,"An Operating System for Multicore and Clouds: Mechanisms and Implementation,"in SoCC, 2010.
[79]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta,"The SPLASH-2 Programs: Characterization and Methodological Considerations,"in ISCA, 1995.
[80]
C. Xu, x. Chen, R. Dick, and Z. Mao, "Cache Contention and Application Performance Prediction for Multi-core systems,"in ISPASS, 2010.
[81]
D. Xu, C. Wu, P.-C. Yew, J. Li, and Z. Wang,"Providing Fairness on Shared-Memory Multiprocessors via Process Scheduling,"SIGMETRICS Perform. Eval. Rev., 2012.
[82]
A. Yasin,"A Top-Down Method for Performance Analysis and Counters Architecture,"in ISPASS, 2014.
[83]
S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto,"Survey of Scheduling Techniques for Addressing Shared Resources in Multicore Processors," ACM CSUR, 2012.

Cited By

View all
  • (2023)SLITS: Sparsity-Lightened Intelligent Thread SchedulingAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/3578338.3593568(21-22)Online publication date: 19-Jun-2023

Index Terms

  1. SLITS: Sparsity-Lightened Intelligent Thread Scheduling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
    Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 7, Issue 1
    POMACS
    March 2023
    749 pages
    EISSN:2476-1249
    DOI:10.1145/3586099
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 March 2023
    Published in POMACS Volume 7, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. computer systems
    2. resource management
    3. scheduling

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)106
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)SLITS: Sparsity-Lightened Intelligent Thread SchedulingAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/3578338.3593568(21-22)Online publication date: 19-Jun-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media