skip to main content
10.1145/1815961.1815976acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Aérgia: exploiting packet latency slack in on-chip networks

Published:19 June 2010Publication History

ABSTRACT

Traditional Network-on-Chips (NoCs) employ simple arbitration strategies, such as round-robin or oldest-first, to decide which packets should be prioritized in the network. This is counter-intuitive since different packets can have very different effects on system performance due to, e.g., different level of memory-level parallelism (MLP) of applications. Certain packets may be performance-critical because they cause the processor to stall, whereas others may be delayed for a number of cycles with no effect on application-level performance as their latencies are hidden by other outstanding packets'latencies. In this paper, we define slack as a key measure that characterizes the relative importance of a packet. Specifically, the slack of a packet is the number of cycles the packet can be delayed in the network with no effect on execution time. This paper proposes new router prioritization policies that exploit the available slack of interfering packets in order to accelerate performance-critical packets and thus improve overall system performance. When two packets interfere with each other in a router, the packet with the lower slack value is prioritized. We describe mechanisms to estimate slack, prevent starvation, and combine slack-based prioritization with other recently proposed application-aware prioritization mechanisms.

We evaluate slack-based prioritization policies on a 64-core CMP with an 8x8 mesh NoC using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 21.0% over the commonlyused round-robin policy. Averaged over 56 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 10.3%, while also reducing application-level unfairness by 30.8%.

References

  1. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. king Su. Myrinet - A Gigabit-per-Second Local-Area Network. IEEE Micro, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. Journal of Systems Arch., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny. The Power of Priority: NoC Based Distributed Cache Coherency. In NOCS'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA-11, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS-21, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. A. Chien and J. H. Kim. Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks. ISCA-23, 1996.Google ScholarGoogle Scholar
  7. W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Das, O. Mutlu, T. Moscibroda, and C. Das. Application-Aware Prioritization Mechanisms for On-Chip Networks. In MICRO-42, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In SIGCOMM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS-11, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ASPLOS-XV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, May-June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Fields, R. Bodík, and M. Hill. Slack: Maximizing performance under technological constraints. In ISCA-29, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In ISCA-28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Garcia and W. Watson. Servernet II. Parallel Computing, Routing, and Communication Workshop, June 1997.Google ScholarGoogle Scholar
  16. A. Glew. MLP Yes! ILP No! Memory Level Parallelism, or, Why I No Longer Worry About IPC. In ASPLOS Wild and Crazy Ideas Session, 1998.Google ScholarGoogle Scholar
  17. B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In MICRO-42, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In PACT-15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA-16, 2010.Google ScholarGoogle Scholar
  20. D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA-8, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA-35, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. O. Mutlu, H. Kim, and Y. N. Patt. Efficient runahead execution: Power-efficient memory latency tolerance. IEEE Micro, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO-40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA-35, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In HPCA-9, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO-39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. G. Oklobdzija and R. K. Krishnamurthy. Energy-Delay Characteristics of CMOS Adders, High-Performance Energy-Efficient Microprocessor Design, chapter 6. Springer US, 2006.Google ScholarGoogle Scholar
  28. H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation. In MICRO-37, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Qureshi, D. Lynch, O. Mutlu, and Y. Patt. A Case for MLP-Aware Cache Replacement. In ISCA-33, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Qureshi and Y. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO-39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip. DATE, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In MICRO-31, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Subramaniam, A. Bracy, H. Wang, and G. Loh. Criticality-based optimizations for efficient load processing. In HPCA-15, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  34. T. J. Teorey and T. B. Pinkerton. A comparative analysis of disk scheduling policies. Communications of the ACM, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO-24, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. H. Yum, E. J. Kim, and C. Das. QoS provisioning in clusters: an investigation of router and NIC design. In ISCA-28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. Zhang. Virtual clock: a new traffic control algorithm for packet switching networks. SIGCOMM, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Aérgia: exploiting packet latency slack in on-chip networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
        June 2010
        520 pages
        ISBN:9781450300537
        DOI:10.1145/1815961
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
          ISCA '10
          June 2010
          508 pages
          ISSN:0163-5964
          DOI:10.1145/1816038
          Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 June 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader