Abstract
With the recent design shift toward increasing the number of processing elements in a chip, supports for low power, low latency, and high bandwidth in on-chip interconnect are essential. Much of the previous work has focused on router architectures and network topologies using wide/long channels. However, such solutions may result in a complicated router design and a high interconnect power/area cost. In this chapter, we present a method to exploit a table-based data compression technique, relying on value patterns in cache traffic. Compressing a large packet into a small one saves power consumption by reducing required operations in network components and decreases contention by increasing the effective bandwidth of shared resources. The main challenges are providing a scalable implementation of tables and minimizing the latency overhead of compression. We propose a shared table scheme that needs one encoding and one decoding tables for each processing element, and a management protocol that does not require in-order delivery. This scheme eliminates table size dependence on a network size, which realizes scalability and reduces overhead cost of table for compression. Our simulation results are presented for 8-core and 16-core designs. Overall, our compression method improves the packet latency up to 44% with an average of 36% and reduces the network power consumption by 36% on average in 16-core tiled design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A flow represents a pair of source and destination. Therefore, an n-node network has n 2 flows.
- 2.
Most routers have eight ports. Additionally, 12 routers have nine ports to connect a core (eight routers at periphery) or a memory controller (four routers in the center).
- 3.
A flit can be subdivided into multiple physical transfer units (phits). Each phit is transferred across a channel in a single cycle. We assume that the phit size is equal to the flit size in this chapter.
- 4.
VC allocator (R → p) has the longest latency among others and determines the delay of a router pipeline in both designs. Therefore, the number of VCs is selected for one clock cycle time.
- 5.
We put the value table at the router rather than each node in SNUCA-CMP.
- 6.
The destination sharing degree (dest) is the average number of destinations for each value. The source sharing degree (src) is the average number of sources for each value.
References
Alameldeen, A.R., Wood, D.A.: Adaptive Cache Compression for High-Performance Processors. In: Proceedings of ISCA, pp. 212–223 (2004)
Basu, K., Choudhary, A.N., Pisharath, J., Kandemir, M.T.: Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses. In: Proceedings of MICRO, pp. 345–355 (2002)
Beckmann, B.M., Wood, D.A.: Managing Wire Delay in Large Chip-Multiprocessor Caches. In: Proceedings of MICRO, pp. 319–330 (2004)
Cheng, L., Muralimanohar, N., Ramani, K., Balasubramonian, R., Carter, J.B.: Interconnect-Aware Coherence Protocols for Chip Multiprocessors. In: Proceedings of ISCA, pp. 339–351 (2006)
Citron, D., Rudolph, L.: Creating a Wider Bus Using Caching Techniques. In: Proceedings of HPCA, pp. 90–99 (1995)
Dally, W.J.: Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks. IEEE Transactions on Computers 40(9), 1016–1023 (1991)
Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco (2003)
Das, R., Mishra, A.K., Nicopolous, C., Park, D., Narayan, V., Iyer, R., Yousif, M.S., Das, C.R.: Performance and Power Optimization through Data Compression in Network-on-Chip Architectures. In: Proceedings of HPCA, pp. 215–225 (2008)
Eisley, N., Peh, L.S., Shang, L.: In-Network Cache Coherence. In: Proceedings of MICRO, pp. 321–332 (2006)
Hallnor, E.G., Reinhardt, S.K.: A Unified Compressed Memory Hierarchy. In: Proceedings of HPCA, pp. 201–212 (2005)
Ho, R., Mai, K., Horowitz, M.: The Future of Wires. In: Proceedings of the IEEE, pp. 490–504 (2001)
Hoskote, Y., Vangal, S., Singh, A., Borkar, N., Borkar, S.: A 5-GHz Mesh Interconnect for a Teraflops Processor. IEEE Micro 27(5), 51–61 (2007)
Jayasimha, D.N., Zafar, B., Hoskote, Y.: Interconnection Networks: Why They are Different and How to Compare Them. Tech. rep., Microprocessor Technology Lab, Corporate Technology Group, Intel Corp (2007). http://blogs.intel.com/research/terascale/ODI_why-different.pdf
Jin, Y., Yum, K.H., Kim, E.J.: Adaptive Data Compression for High-Performance Low-Power On-Chip Networks. In: Proceedings of MICRO, pp. 354–363 (2008)
Kim, C., Burger, D., Keckler, S.W.: An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In: Proceedings of ASPLOS, pp. 211–222 (2002)
Kim, J., Balfour, J., Dally, W.J.: Flattened Butterfly Topology for On-Chip Networks. In: Proceedings of MICRO, pp. 172–182 (2007)
Kirman, N., Kirman, M., Dokania, R.K., MartÃnez, J.F., Apsel, A.B., Watkins, M.A., Albonesi, D.H.: Leveraging Optical Technology in Future Bus-based Chip Multiprocessors. In: Proceedings of MICRO, pp. 492–503 (2006)
Lipasti, M.H., Wilkerson, C.B., Shen, J.P.: Value Locality and Load Value Prediction. In: Proceedings of ASPLOS, pp. 138–147 (1996)
Liu, C., Sivasubramaniam, A., Kandermir, M.: Optimizing Bus Energy Consumption of On-Chip Multiprocessors Using Frequent Values. Journal of Systems Architecture 52, 129–142 (2006)
Lv, T., Henkel, J., Lekatsas, H., Wolf, W.: A Dictionary-Based En/Decoding Scheme for Low-Power Data Buses. IEEE Transactions on VLSI Systems 11(5), 943–951 (2003)
Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. IEEE Computer 35(2), 50–58 (2002)
Martin, M.M., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News 33(4), 92–99 (2005)
Mullins, R.D., West, A., Moore, S.W.: Low-Latency Virtual-Channel Routers for On-Chip Networks. In: Proceedings of ISCA, pp. 188–197 (2004)
Peh, L.S., Dally, W.J.: A Delay Model and Speculative Architecture for Pipelined Routers. In: Proceedings of HPCA, pp. 255–266 (2001)
Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. In: Proceedings of ISCA, pp. 422–433 (2003)
Sotiriadis, P.P., Chandrakasan, A.: Bus Energy Minimization by Transition Pattern Coding (TPC) in Deep Submicron Technologies. In: Proceedings of ICCAD, pp. 322–327 (2000)
Stan, M., Burleson, W.: Bus-Invert Coding for Low-Power I/O. IEEE Transaction on VLSI 3(1), 49–58 (1995)
Tarjan, D., Thoziyoor, S., Jouppi, N.P.: Cacti 4.0. Tech. Rep. HPL-2006-86, HP Laboratories (2006)
Taylor, M.B., Lee, W., Amarasinghe, S.P., Agarwal, A.: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architecture. In: Proceedings of HPCA, pp. 341–353 (2003)
Wang, H., Zhu, X., Peh, L.S., Malik, S.: Orion: a Power-Performance Simulator for Interconnection Networks. In: Proceedings of MICRO, pp. 294–305 (2002)
Wen, V., Whitney, M., Patel, Y., Kubiatowicz, J.: Exploiting Prediction to Reduce Power on Buses. In: Proceedings of HPCA, pp. 2–13 (2004)
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., III, J.F.B., Agarwal, A.: On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro 27(5), 15–31 (2007)
Yang, J., Gupta, R.: Energy Efficient Frequent Value Data Cache Design. In: Proceedings of MICRO, pp. 197–207 (2002)
Yang, J., Gupta, R., Zhang, C.: Frequent Value Encoding for Low Power Data Buses. ACM Transactions on Design Automation of Electronic Systems 9(3), 354–384 (2004)
Zhang, M., Asanovic, K.: Highly-Associative Caches for Low-Power Processors. In: Kool Chips Workshop, MICRO-33 (2000)
Zhang, Y., Yang, J., Gupta, R.: Frequent Value Locality and Value-Centric Data Cache Design. In: Proceedings of ASPLOS, pp. 150–159 (2000)
Acknowledgements
This work was supported in part by NSF grants CCF-0541360 and CCF-0541384. Yuho Jin is currently supported by the National Science Foundation under Grant no. 0937060 to the Computing Research Association for the CIFellows Project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Jin, Y., Yum, K.H., Kim, E.J. (2011). Adaptive Data Compression for Low-Power On-Chip Networks. In: Silvano, C., Lajolo, M., Palermo, G. (eds) Low Power Networks-on-Chip. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6911-8_6
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6911-8_6
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-6910-1
Online ISBN: 978-1-4419-6911-8
eBook Packages: EngineeringEngineering (R0)