skip to main content
10.1145/2370816.2370864acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads

Published: 19 September 2012 Publication History

Abstract

State-of-the-art graphic processing units (GPUs) provide very high memory bandwidth, but the performance of many general-purpose GPU (GPGPU) workloads is still bounded by memory bandwidth. Although compression techniques have been adopted by commercial GPUs, they are only used for compressing texture and color data, not data for GPGPU workloads. Furthermore, the microarchitectural details of GPU compression are proprietary and its performance benefits have not been previously published. In this paper, we first investigate required microarchitectural changes to support lossless compression techniques for data transferred between the GPU and its off-chip memory to provide higher effective bandwidth. Second, by exploiting some characteristics of floating-point numbers in many GPGPU workloads, we propose to apply lossless compression to floating-point numbers after truncating their least-significant bits (i.e., lossy compression). This can reduce the bandwidth usage even further with very little impact on overall computational accuracy. Finally, we demonstrate that a GPU with our lossless and lossy compression techniques can improve the performance of memory-bound GPGPU workloads by 26% and 41% on average.

References

[1]
Rambus, "Challenges and Solutions for Future Main Memory," 2009.
[2]
J. Montrym and H. Moreton, "The GeForce 6800," IEEE Micro, vol. 25, no. 2, pp. 41--51, Mar-Apr 2005.
[3]
D. Chen, E. Peserico, and L. Rudolph, "A Dynamically Partitionable Compressed Cache," in Singapore-MIT Alliance Symp., 2003.
[4]
A. R. Alameldeen and D. A. Wood, "Adaptive Cache Compression for High Performance Processors," in IEEE/ACM Int. Symp. on Comp. Arch. (ISCA), 2004, pp. 212--223.
[5]
B. Abali et al., "Memory Expansion Technology (MXT): Software Support and Performance," IBM J. Research and Development, vol. 45, no. 2, pp. 287--301, Mar 2001.
[6]
M. Thuresson, L. Spracklen, and P. Stenstrom, "Memory-Link Compression Schemes: A Value Locality Perspective," IEEE T. on Computers, vol. 57, no. 7, pp. 916--927, Jul 2007.
[7]
T. Yeh, G. Reinman, S. Patel, and P. Faloutsos, "Fool me twice: Exploring and exploiting error tolerance in physics-based animation," ACM Trans. Graph., vol. 29, no. 1, p. Article 5, Dec 2009.
[8]
J. Tong, D. Nagle, and R. Rutenbar, "Reducing Power by Optimizing the Necessary Precision/Range of Floating-Point Arithmetic," IEEE T. on Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp. 273--285, Jun 2000.
[9]
Xi Chen, Yang L., R. P. Dick, Li Shang, and H. Lekatsas, "C-Pack: A High-Performance Microprocessor Cache Compression Algorithm," IEEE T. on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 8, pp. 196--1208, Aug 2010.
[10]
GDDR3 Specific SGRAM Functions. {Online}. http://www.jedec.org/standards-documents/docs/sdram-3110507
[11]
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU Architecture," IEEE Micro, vol. 31, no. 2, pp. 50--59, Mar 2011.
[12]
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads using a Detailed GPU Simulator," in IEEE Int. Symp. Perf. Analysis of Syst. and Software (ISPASS), 2009, pp. 163--174.
[13]
Specification for Nvidia Quadro FX 5800. {Online}. http://www.nvidia.com/object/product_quadro_fx_5800_us.html
[14]
S. Che et al., "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IEEE Int. Symp. on Workload Characterization (IISWC), 2009, pp. 44--54.
[15]
Parboil Benchmark Suite. {Online}. http://impact.crhc.illinois.edu/parboil.php.
[16]
D. W. Chang et al., "ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing," in IEEE Int. Conf. on Field Programmable Logic and App. (FPL), 2010, pp. 408--413.
[17]
J. Lee, P. Ajgaonkar, and N. S. Kim, "Analyzing throughput of GPGPUs exploiting within-die core-to-core frequency variation," in IEEE Int. Symp. on Performance Analysis of Syst. and Software (ISPASS), 2011, pp. 237--246.
[18]
Y. Zhang, J. Yang, and R. Gupta, "Frequent Value Locality and Value-centric Data Cache Design," in ACM Int. Conf. Arch. Support for Programming Lang. and Operating Syst. (ASPLOS), 2009, pp. 150--159.
[19]
L. Benini, D. Bruni, B. Ricco, A. Macii, and E. Macii, "An Adaptive Data Compression Scheme for Memory Traffic Minimization in Processor-Based Systems," in IEEE Int. Conf. on Circuits and Syst (ICCAS), May 2002, pp. 866--869.

Cited By

View all
  • (2024)Memory Allocation Under Hardware Compression2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00075(966-982)Online publication date: 2-Nov-2024
  • (2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
  • (2024) Using the Discrete Wavelet Transform for Lossy On‐the‐Fly Compression of GPU Fluid Simulations International Journal for Numerical Methods in Fluids10.1002/fld.534497:3(283-298)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
    September 2012
    512 pages
    ISBN:9781450311823
    DOI:10.1145/2370816
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 September 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. graphics processing units
    2. lossless and lossy data compression

    Qualifiers

    • Research-article

    Conference

    PACT '12
    Sponsor:
    • IFIP WG 10.3
    • SIGARCH
    • IEEE CS TCPP
    • IEEE CS TCAA

    Acceptance Rates

    Overall Acceptance Rate 121 of 471 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Memory Allocation Under Hardware Compression2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00075(966-982)Online publication date: 2-Nov-2024
    • (2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
    • (2024) Using the Discrete Wavelet Transform for Lossy On‐the‐Fly Compression of GPU Fluid Simulations International Journal for Numerical Methods in Fluids10.1002/fld.534497:3(283-298)Online publication date: 27-Oct-2024
    • (2023)DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794457:1(1-36)Online publication date: 2-Mar-2023
    • (2023)Beyond Compression Ratio: A Throughput Analysis of Memory Compression Techniques for GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00047(255-262)Online publication date: 6-Nov-2023
    • (2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
    • (2022)Translation-Optimized Memory Compression for CapacityProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00073(992-1011)Online publication date: 1-Oct-2022
    • (2021)QSLC: Quantization-Based, Low-Error Selective Approximation for GPUs2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474124(475-480)Online publication date: 1-Feb-2021
    • (2020)MemSZACM Transactions on Architecture and Code Optimization10.1145/342466817:4(1-25)Online publication date: 10-Nov-2020
    • (2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media