research-article

Base-delta-immediate compression: practical data compression for on-chip caches

Authors:

Gennady Pekhimenko,

Vivek Seshadri,

Phillip B. Gibbons,

Michael A. Kozuch,

Todd C. MowryAuthors Info & Claims

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 377 - 388

https://doi.org/10.1145/2370816.2370870

Published: 19 September 2012 Publication History

Abstract

Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

In this paper, we introduce a new compression algorithm called Base-Delta-Immediate (BΔI) compression, a practical technique for compressing data in on-chip caches. The key idea is that, for many cache lines, the values within the cache line have a low dynamic range - i.e., the differences between values stored within the cache line are small. As a result, a cache line can be represented using a base value and an array of differences whose combined size is much smaller than the original cache line (we call this the base+delta encoding). Moreover, many cache lines intersperse such base+delta values with small values - our BΔI technique efficiently incorporates such immediate values into its encoding.

Compared to prior cache compression approaches, our studies show that BΔI strikes a sweet-spot in the tradeoff between compression ratio, decompression/compression latencies, and hardware complexity. Our results show that BΔI compression improves performance for both single-core (8.1% improvement) and multi-core workloads (9.5% / 11.2% improvement for two/four cores). For many applications, BΔI provides the performance benefit of doubling the cache size of the baseline system, effectively increasing average cache capacity by 1.53X.

References

[1]

B. Abali, H. Franke, D. E. Poff, R. A. Saccone, C. O. Schulz, L. M. Herger, and T. B. Smith. Memory expansion technology (MXT): software support and performance. IBM JRD, 2001.

Digital Library

[2]

A. R. Alameldeen and D. A. Wood. Adaptive cache compression for high-performance processors. In ISCA-31, 2004.

Digital Library

[3]

A. R. Alameldeen and D. A. Wood. Frequent pattern compression: A significance-based compression scheme for L2 caches. Tech. Rep., University of Wisconsin-Madison, 2004.

[4]

S. Balakrishnan and G. S. Sohi. Exploiting value locality in physical register files. In MICRO-36, 2003.

Digital Library

[5]

J. Chen and W. A. Watson-III. Multi-threading performance on commodity multi-core processors. In Proceedings of HPCAsia, 2007.

[6]

X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsas. C-pack: A high-performance microprocessor cache compression algorithm. In IEEE TVLSI, Aug. 2010.

Digital Library

[7]

R. Das, A. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer, M. Yousif, and C. Das. Performance and power optimization through data compression in network-on-chip architectures. In HPCA, 2008.

[8]

J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In ICS, 2009.

Digital Library

[9]

M. Ekman and P. Stenström. A robust main-memory compression scheme. In ISCA-32, 2005.

Digital Library

[10]

M. Farrens and A. Park. Dynamic base register caching: a technique for reducing address bus width. In ISCA-18, 1991.

Digital Library

[11]

E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In ISCA-27, 2000.

Digital Library

[12]

E. G. Hallnor and S. K. Reinhardt. A unified compressed memory hierarchy. In HPCA-11, 2005.

Digital Library

[13]

D. W. Hammerstrom and E. S. Davidson. Information content of CPU memory referencing behavior. In ISCA-4, 1977.

Digital Library

[14]

D. Huffman. A method for the construction of minimum-redundancy codes. 1952.

[15]

M. M. Islam and P. Stenström. Zero-value caches: Cancelling loads that return zero. In PACT, 2009.

Digital Library

[16]

M. M. Islam and P. Stenström. Characterization and exploitation of narrow-width loads: the narrow-width cache approach. In CASES, 2010.

Digital Library

[17]

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA-37, 2010.

Digital Library

[18]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. 2002.

Digital Library

[19]

D. Molka, D. Hackenberg, R. Schone, and M. Muller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT, 2009.

Digital Library

[20]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007.

Digital Library

[21]

M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In HPCA-13, 2007.

Digital Library

[22]

M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand based associativity via global replacement. ISCA-32, 2005.

Digital Library

[23]

Y. Sazeides and J. E. Smith. The predictability of data values. In MICRO-30, 1997.

Digital Library

[24]

A. B. Sharma, L. Golubchik, R. Govindan, and M. J. Neely. Dynamic data compression in multi-hop wireless networks. In SIGMETRICS, 2009.

Digital Library

[25]

A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multithreaded processor. ASPLOS-9, 2000.

Digital Library

[26]

SPEC CPU2006 Benchmarks. http://www.spec.org/.

[27]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA-13, 2007.

Digital Library

[28]

W. Sun, Y. Lu, F. Wu, and S. Li. DHTC: an effective DXTC-based HDR texture compression scheme. In Graphics Hardware, 2008.

Digital Library

[29]

S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Laboratories, 2008.

[30]

Transaction Processing Performance Council. http://www.tpc.org/.

[31]

L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In MICRO-33, 2000.

Digital Library

[32]

P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis. The case for compressed caching in virtual memory systems. In USENIX ATC, 1999.

Digital Library

[33]

J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In MICRO-33, 2000.

Digital Library

[34]

Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. ASPLOS-9, 2000.

Digital Library

[35]

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977.

Digital Library

Cited By

Upadhyay SKapoor H(2025)PAF-Enc: Position Affine Encoding to Reduce Bit-Flips in Non-Volatile Main Memories2025 38th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID64188.2025.00059(266-271)Online publication date: 4-Jan-2025
https://doi.org/10.1109/VLSID64188.2025.00059
S AVerma HKapoor H(2025)Optimizing Bandwidth Utilization Through Word Based Compression in Main Memories2025 38th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID64188.2025.00029(91-96)Online publication date: 4-Jan-2025
https://doi.org/10.1109/VLSID64188.2025.00029
Kim SByeon GLee SBae YHong S(2025)Accelerating Deep Neural Networks with a Low-Cost Lossless Compression2025 International Conference on Electronics, Information, and Communication (ICEIC)10.1109/ICEIC64972.2025.10879738(1-4)Online publication date: 19-Jan-2025
https://doi.org/10.1109/ICEIC64972.2025.10879738
Show More Cited By

Index Terms

Base-delta-immediate compression: practical data compression for on-chip caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression

Recommendations

Base-victim compression: an opportunistic cache compression architecture
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Base-victim compression: an opportunistic cache compression architecture
ISCA'16

The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Large off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

September 2012

512 pages

ISBN:9781450311823

DOI:10.1145/2370816

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '12

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '12: International Conference on Parallel Architectures and Compilation Techniques

September 19 - 23, 2012

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

324
Total Citations
View Citations
1,339
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)16

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Upadhyay SKapoor H(2025)PAF-Enc: Position Affine Encoding to Reduce Bit-Flips in Non-Volatile Main Memories2025 38th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID64188.2025.00059(266-271)Online publication date: 4-Jan-2025
https://doi.org/10.1109/VLSID64188.2025.00059
S AVerma HKapoor H(2025)Optimizing Bandwidth Utilization Through Word Based Compression in Main Memories2025 38th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID)10.1109/VLSID64188.2025.00029(91-96)Online publication date: 4-Jan-2025
https://doi.org/10.1109/VLSID64188.2025.00029
Kim SByeon GLee SBae YHong S(2025)Accelerating Deep Neural Networks with a Low-Cost Lossless Compression2025 International Conference on Electronics, Information, and Communication (ICEIC)10.1109/ICEIC64972.2025.10879738(1-4)Online publication date: 19-Jan-2025
https://doi.org/10.1109/ICEIC64972.2025.10879738
Park SChoi BKim J(2025)C4ECC: Data Compression for Bandwidth Efficiency Under ECC Protection in GPUs2025 International Conference on Electronics, Information, and Communication (ICEIC)10.1109/ICEIC64972.2025.10879667(1-4)Online publication date: 19-Jan-2025
https://doi.org/10.1109/ICEIC64972.2025.10879667
Lazarev NGohil VTsai JAnderson AChitlur BZhang ZDelimitrou CGavrilovska ATerry D(2024)SabreProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691939(1-18)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691939
Surchenko ANedbailo Y(2024)Hardware Compression Method for On-Chip and Interprocessor Networks with Wide Channels and Wormhole Flow Control PolicyМетодика компрессии данных в накристальных и межпроцессорных сетях с широкими каналами и политикой управления потоком wormholeInformatics and AutomationИнформатика и автоматизация10.15622/ia.23.3.823:3(859-885)Online publication date: 28-May-2024
https://doi.org/10.15622/ia.23.3.8
Xu QYang JZhang FChen ZGuan JChen KFan JShen YYang KZhang YDu X(2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665852
Shao QArelakis AStenström P(2024)HMComp: Extending Near-Memory Capacity using Compression in Hybrid MemoryProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656612(74-84)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656612
Sha ELiu AIbrahim KMahmoud MGiannoula CAbdelhadi AMoshovos ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Marple: Scalable Spike Sorting for Untethered Brain-Machine InterfacingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640357(666-682)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640357
Cheshmikhani EShokouhinia FFarbeh H(2024)A Low-Cost Fault-Tolerant Racetrack Cache Based on Data CompressionIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337564071:8(3940-3944)Online publication date: Aug-2024
https://doi.org/10.1109/TCSII.2024.3375640
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten