skip to main content
10.1145/3037697.3037753acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

Published: 04 April 2017 Publication History

Abstract

NUMA (non-uniform memory access) servers are commonly used in high-performance computing and datacenters. Within each server, a processor-interconnect (e.g., Intel QPI, AMD HyperTransport) is used to communicate between the different sockets or nodes. In this work, we explore the impact of the processor-interconnect on overall performance -- in particular, the performance un- fairness caused by processor-interconnect arbitration. It is well known that locally-fair arbitration does not guarantee globally-fair bandwidth sharing as closer nodes receive more bandwidth in a multi-hop network. However, this work demonstrates that the opposite can occur in a commodity NUMA server where remote nodes receive higher bandwidth (and perform better). We analyze this problem and iden- tify that this occurs because of external concentration used in router micro-architectures for processor-interconnects without globally-aware arbitration. While accessing remote memory can occur in any NUMA system, performance un- fairness (or performance variation) is more critical in cloud computing and virtual machines with shared resources. We demonstrate how this unfairness creates significant performance variation when a workload is executed on the Xen virtualization platform. We then provide analysis using synthetic workloads to better understand the source of unfair- ness and eliminate the impact of other shared resources, including the shared last-level cache and main memory. To provide fairness, we propose a novel, history-based arbitration that tracks the history of arbitration grants made in the previous history window. A weighted arbitration is done based on the history to provide global fairness. Through simulations, we show our proposed history-based arbitration can provide global fairness and minimize the processor- interconnect performance unfairness at low cost.

References

[1]
D. Abts and D. Weisser. Age-Based Packet Arbitration in Large-Radix k-ary n-cubes. In ICS, 2007.
[2]
J. Ahn, S. Li, O. Seongil, and N. P. Jouppi. McSimA
[3]
: A Manycore Simulator with Application-level
[4]
Simulation and Detailed Microarchitecture Modeling. In ISPASS, 2013.
[5]
J. Balfour and W. J. Dally. Design Tradeoffs for Tiled CMP On-Chip Networks. In ICS, 2006.
[6]
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In SOSP, 2003.
[7]
E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS Architecture and Design Process for Network on Chip. Journal of Systems Architecture, 2004.
[8]
P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. IEEE Micro, 2007.
[9]
W. J. Dally and B. Towles. Route Packets, Not Wires: On-Chip Iinterconnection Networks. In DAC, 2001.
[10]
W. J. Dally and B. P. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.
[11]
R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. Application-Aware Prioritization Mechanisms for On-Chip Networks. In MICRO, 2009.
[12]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS, 2013.
[13]
A. Demers, S. Keshav, and S. Shenker. Analysis and Simulation of a Fair Queueing Algorithm. In SIGCOMM, 1989.
[14]
B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In MICRO, 2009.
[15]
Intel. An Introduction to the Intel QuickPath Interconnect, 2009. URL http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf.
[16]
N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim. A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In ISPASS, 2013.
[17]
R. E. Kessler and J. L. Schwarzmeier. CRAY T3D: A New Dimension for Cray Research. In COMPCON, 1993.
[18]
J. H. Kim and A. A. Chien. Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks. In ISCA, 1996.
[19]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.
[20]
P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary. Exploring concentration and channel slicing in on-chip network router. In NOCS, 2009.
[21]
J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA, 2008.
[22]
M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee. Probabilistic Distance-based Arbitration: Providing Equality of Service for Many-core CMPs. In MICRO, 2010.
[23]
M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed Bandwidth using Looped Containers in Temporally Disjoint Networks within the Nostrum Network on Chip. In DATE, 2004.
[24]
O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.
[25]
O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.
[26]
B. Mutnury, F. Paglia, J. Mobley, G. K. Singh, and R. Bellomio. QuickPath Interconnect (QPI) Design and Aanalysis in High Speed Servers. In EPEPS, 2010.
[27]
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In MICRO, 2006.
[28]
J. Ouyang and Y. Xie. LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. In MICRO, 2010.
[29]
J. Rao, K. Wang, X. Zhou, and C.-Z. Xu. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In HPCA, 2013.
[30]
P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote, S. Vangal, G. Ruhl, and N. Borkar. A 2 Tb/s 6 x 4 Mesh Network for a Single-Chip Cloud Computer with DVFS in 45 nm CMOS. IEEE Journal of Solid-State Circuits, 2011.
[31]
G. Sartori. Hypertransport Technology. In Platform Conference, 2001.
[32]
W. Song, H. J. Jung, J. Ahn, J. Lee, and J. Kim. Evaluation of performance unfairness in numa system architecture. IEEE Computer Architecture Letters, 2016.
[33]
W. Song, J. Kim. D. Abts, and J. Lee. Security Vulnerability in Processor-Interconnect Router Design. In CCS, 2014.
[34]
W. Song, H. Choi, J. Kim, E. Kim, Y. Kim, and J. Kim. PIkit: A New Kernel-Independent Processor-Interconnect Rootkit. In USENIX Security, 2016.
[35]
L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's Warehouse Scale Computers: The NUMA Experience. In HPCA, 2013.
[36]
G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity effective memory access scheduling for many-core accelerator architectures. In MICRO, 2009.
[37]
H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memguard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. In RTAS, 2013.
[38]
L. Zhang. Virtual Clock: A New Traffic Control Algorithm for Packet Switching Networks. In SIGCOMM, 1990.

Cited By

View all
  • (2021)Rethinking remote memory placement on large-memory systems with path diversityProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477516(63-69)Online publication date: 24-Aug-2021
  • (2019)Unfair Scheduling Patterns in NUMA ArchitecturesProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00024(205-218)Online publication date: 23-Sep-2019
  • (2019)Enforcing Last-level Cache Partitioning through Memory Virtual ChannelsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00016(97-109)Online publication date: 23-Sep-2019
  • Show More Cited By

Index Terms

  1. History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
    April 2017
    856 pages
    ISBN:9781450344654
    DOI:10.1145/3037697
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 April 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. arbitration
    2. numa servers
    3. processor-interconnect
    4. router concentration

    Qualifiers

    • Research-article

    Funding Sources

    • National Research Foundation of Korea

    Conference

    ASPLOS '17

    Acceptance Rates

    ASPLOS '17 Paper Acceptance Rate 53 of 320 submissions, 17%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Rethinking remote memory placement on large-memory systems with path diversityProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477516(63-69)Online publication date: 24-Aug-2021
    • (2019)Unfair Scheduling Patterns in NUMA ArchitecturesProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00024(205-218)Online publication date: 23-Sep-2019
    • (2019)Enforcing Last-level Cache Partitioning through Memory Virtual ChannelsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00016(97-109)Online publication date: 23-Sep-2019
    • (2019)A Holistic Model for Performance Prediction and Optimization on NUMA-based Virtualized SystemsIEEE INFOCOM 2019 - IEEE Conference on Computer Communications10.1109/INFOCOM.2019.8737447(352-360)Online publication date: 29-Apr-2019
    • (2019)A barrier optimization framework for NUMA multi‐core systemConcurrency and Computation: Practice and Experience10.1002/cpe.552732:5Online publication date: 21-Oct-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media