ABSTRACT
Traditional Network-on-Chips (NoCs) employ simple arbitration strategies, such as round-robin or oldest-first, to decide which packets should be prioritized in the network. This is counter-intuitive since different packets can have very different effects on system performance due to, e.g., different level of memory-level parallelism (MLP) of applications. Certain packets may be performance-critical because they cause the processor to stall, whereas others may be delayed for a number of cycles with no effect on application-level performance as their latencies are hidden by other outstanding packets'latencies. In this paper, we define slack as a key measure that characterizes the relative importance of a packet. Specifically, the slack of a packet is the number of cycles the packet can be delayed in the network with no effect on execution time. This paper proposes new router prioritization policies that exploit the available slack of interfering packets in order to accelerate performance-critical packets and thus improve overall system performance. When two packets interfere with each other in a router, the packet with the lower slack value is prioritized. We describe mechanisms to estimate slack, prevent starvation, and combine slack-based prioritization with other recently proposed application-aware prioritization mechanisms.
We evaluate slack-based prioritization policies on a 64-core CMP with an 8x8 mesh NoC using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 21.0% over the commonlyused round-robin policy. Averaged over 56 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 10.3%, while also reducing application-level unfairness by 30.8%.
- N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. king Su. Myrinet - A Gigabit-per-Second Local-Area Network. IEEE Micro, 1995. Google ScholarDigital Library
- E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. Journal of Systems Arch., 2004. Google ScholarDigital Library
- E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny. The Power of Priority: NoC Based Distributed Cache Coherency. In NOCS'07, 2007. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA-11, 2005. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS-21, 2007. Google ScholarDigital Library
- A. A. Chien and J. H. Kim. Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks. ISCA-23, 1996.Google Scholar
- W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003. Google ScholarDigital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. Das. Application-Aware Prioritization Mechanisms for On-Chip Networks. In MICRO-42, 2009. Google ScholarDigital Library
- A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In SIGCOMM, 1989. Google ScholarDigital Library
- J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS-11, 1997. Google ScholarDigital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ASPLOS-XV, 2010. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, May-June 2008. Google ScholarDigital Library
- B. Fields, R. Bodík, and M. Hill. Slack: Maximizing performance under technological constraints. In ISCA-29, 2002. Google ScholarDigital Library
- B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In ISCA-28, 2001. Google ScholarDigital Library
- D. Garcia and W. Watson. Servernet II. Parallel Computing, Routing, and Communication Workshop, June 1997.Google Scholar
- A. Glew. MLP Yes! ILP No! Memory Level Parallelism, or, Why I No Longer Worry About IPC. In ASPLOS Wild and Crazy Ideas Session, 1998.Google Scholar
- B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In MICRO-42, 2009. Google ScholarDigital Library
- L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In PACT-15, 2006. Google ScholarDigital Library
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA-16, 2010.Google Scholar
- D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA-8, 1981. Google ScholarDigital Library
- J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA-35, 2008. Google ScholarDigital Library
- O. Mutlu, H. Kim, and Y. N. Patt. Efficient runahead execution: Power-efficient memory latency tolerance. IEEE Micro, 2006. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO-40, 2007. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA-35, 2008. Google ScholarDigital Library
- O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In HPCA-9, 2003. Google ScholarDigital Library
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO-39, 2006. Google ScholarDigital Library
- V. G. Oklobdzija and R. K. Krishnamurthy. Energy-Delay Characteristics of CMOS Adders, High-Performance Energy-Efficient Microprocessor Design, chapter 6. Springer US, 2006.Google Scholar
- H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation. In MICRO-37, 2004. Google ScholarDigital Library
- M. Qureshi, D. Lynch, O. Mutlu, and Y. Patt. A Case for MLP-Aware Cache Replacement. In ISCA-33, 2006. Google ScholarDigital Library
- M. Qureshi and Y. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO-39, 2006. Google ScholarDigital Library
- E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip. DATE, 2003. Google ScholarDigital Library
- S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In MICRO-31, 1998. Google ScholarDigital Library
- S. Subramaniam, A. Bracy, H. Wang, and G. Loh. Criticality-based optimizations for efficient load processing. In HPCA-15, 2009.Google ScholarCross Ref
- T. J. Teorey and T. B. Pinkerton. A comparative analysis of disk scheduling policies. Communications of the ACM, 1972. Google ScholarDigital Library
- R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 1967. Google ScholarDigital Library
- T. Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO-24, 1991. Google ScholarDigital Library
- K. H. Yum, E. J. Kim, and C. Das. QoS provisioning in clusters: an investigation of router and NIC design. In ISCA-28, 2001. Google ScholarDigital Library
- L. Zhang. Virtual clock: a new traffic control algorithm for packet switching networks. SIGCOMM, 1990. Google ScholarDigital Library
Index Terms
- Aérgia: exploiting packet latency slack in on-chip networks
Recommendations
Aérgia: exploiting packet latency slack in on-chip networks
ISCA '10Traditional Network-on-Chips (NoCs) employ simple arbitration strategies, such as round-robin or oldest-first, to decide which packets should be prioritized in the network. This is counter-intuitive since different packets can have very different ...
Application-aware prioritization mechanisms for on-chip networks
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureNetwork-on-Chips (NoCs) are likely to become a critical shared resource in future many-core processors. The challenge is to develop policies and mechanisms that enable multiple applications to efficiently and fairly share the network, to improve system ...
Aérgia: A Network-on-Chip Exploiting Packet Latency Slack
A traditional Network-on-Chip (NoC) employs simple arbitration strategies, such as round robin or oldest first, which treat packets equally regardless of the source applications' characteristics. This is suboptimal because packets can have different ...
Comments