research-article

Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads

Authors:

Michael J. Schulte,

Nam Sung KimAuthors Info & Claims

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 325 - 334

https://doi.org/10.1145/2370816.2370864

Published: 19 September 2012 Publication History

Abstract

State-of-the-art graphic processing units (GPUs) provide very high memory bandwidth, but the performance of many general-purpose GPU (GPGPU) workloads is still bounded by memory bandwidth. Although compression techniques have been adopted by commercial GPUs, they are only used for compressing texture and color data, not data for GPGPU workloads. Furthermore, the microarchitectural details of GPU compression are proprietary and its performance benefits have not been previously published. In this paper, we first investigate required microarchitectural changes to support lossless compression techniques for data transferred between the GPU and its off-chip memory to provide higher effective bandwidth. Second, by exploiting some characteristics of floating-point numbers in many GPGPU workloads, we propose to apply lossless compression to floating-point numbers after truncating their least-significant bits (i.e., lossy compression). This can reduce the bandwidth usage even further with very little impact on overall computational accuracy. Finally, we demonstrate that a GPU with our lossless and lossy compression techniques can improve the performance of memory-bound GPGPU workloads by 26% and 41% on average.

References

[1]

Rambus, "Challenges and Solutions for Future Main Memory," 2009.

[2]

J. Montrym and H. Moreton, "The GeForce 6800," IEEE Micro, vol. 25, no. 2, pp. 41--51, Mar-Apr 2005.

Digital Library

[3]

D. Chen, E. Peserico, and L. Rudolph, "A Dynamically Partitionable Compressed Cache," in Singapore-MIT Alliance Symp., 2003.

[4]

A. R. Alameldeen and D. A. Wood, "Adaptive Cache Compression for High Performance Processors," in IEEE/ACM Int. Symp. on Comp. Arch. (ISCA), 2004, pp. 212--223.

Digital Library

[5]

B. Abali et al., "Memory Expansion Technology (MXT): Software Support and Performance," IBM J. Research and Development, vol. 45, no. 2, pp. 287--301, Mar 2001.

Digital Library

[6]

M. Thuresson, L. Spracklen, and P. Stenstrom, "Memory-Link Compression Schemes: A Value Locality Perspective," IEEE T. on Computers, vol. 57, no. 7, pp. 916--927, Jul 2007.

Digital Library

[7]

T. Yeh, G. Reinman, S. Patel, and P. Faloutsos, "Fool me twice: Exploring and exploiting error tolerance in physics-based animation," ACM Trans. Graph., vol. 29, no. 1, p. Article 5, Dec 2009.

Digital Library

[8]

J. Tong, D. Nagle, and R. Rutenbar, "Reducing Power by Optimizing the Necessary Precision/Range of Floating-Point Arithmetic," IEEE T. on Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp. 273--285, Jun 2000.

Digital Library

[9]

Xi Chen, Yang L., R. P. Dick, Li Shang, and H. Lekatsas, "C-Pack: A High-Performance Microprocessor Cache Compression Algorithm," IEEE T. on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 8, pp. 196--1208, Aug 2010.

Digital Library

[10]

GDDR3 Specific SGRAM Functions. {Online}. http://www.jedec.org/standards-documents/docs/sdram-3110507

[11]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, "Fermi GF100 GPU Architecture," IEEE Micro, vol. 31, no. 2, pp. 50--59, Mar 2011.

Digital Library

[12]

A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads using a Detailed GPU Simulator," in IEEE Int. Symp. Perf. Analysis of Syst. and Software (ISPASS), 2009, pp. 163--174.

[13]

Specification for Nvidia Quadro FX 5800. {Online}. http://www.nvidia.com/object/product_quadro_fx_5800_us.html

[14]

S. Che et al., "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IEEE Int. Symp. on Workload Characterization (IISWC), 2009, pp. 44--54.

Digital Library

[15]

Parboil Benchmark Suite. {Online}. http://impact.crhc.illinois.edu/parboil.php.

[16]

D. W. Chang et al., "ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing," in IEEE Int. Conf. on Field Programmable Logic and App. (FPL), 2010, pp. 408--413.

Digital Library

[17]

J. Lee, P. Ajgaonkar, and N. S. Kim, "Analyzing throughput of GPGPUs exploiting within-die core-to-core frequency variation," in IEEE Int. Symp. on Performance Analysis of Syst. and Software (ISPASS), 2011, pp. 237--246.

Digital Library

[18]

Y. Zhang, J. Yang, and R. Gupta, "Frequent Value Locality and Value-centric Data Cache Design," in ACM Int. Conf. Arch. Support for Programming Lang. and Operating Syst. (ASPLOS), 2009, pp. 150--159.

Digital Library

[19]

L. Benini, D. Bruni, B. Ricco, A. Macii, and E. Macii, "An Adaptive Data Compression Scheme for Memory Traffic Minimization in Processor-Based Systems," in IEEE Int. Conf. on Circuits and Syst (ICCAS), May 2002, pp. 866--869.

Cited By

Laghari MLiu YPanwar GBears DJearls CSrinivas RChoukse ECameron KButt AJian X(2024)Memory Allocation Under Hardware Compression2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00075(966-982)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00075
Buyuktosunoglu ATrilla DAbali BBerger DWalters CLee J(2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00080
Flint CHuppé AHelluy PBramas BGenaud S(2024) Using the Discrete Wavelet Transform for Lossy On‐the‐Fly Compression of GPU Fluid Simulations International Journal for Numerical Methods in Fluids10.1002/fld.534497:3(283-298)Online publication date: 27-Oct-2024
https://doi.org/10.1002/fld.5344
Show More Cited By

Index Terms

Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Algorithmic performance studies on graphics processing units

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...
Lossless-by-Lossy Coding for Scalable Lossless Image Compression

This paper presents a method of scalable lossless image compression by means of lossy coding. A progressive decoding capability and a full decoding for the lossless rendition are equipped with the losslessly encoded bit stream. Embedded coding is ...
Lossless Data Compression: Data compression, Algorithm, Lossy compression, Bit rate, ZIP (file format), Unix, Gzip, Portable Network Graphics, Graphics Interchange Format, Tagged Image File Format

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

September 2012

512 pages

ISBN:9781450311823

DOI:10.1145/2370816

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '12

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '12: International Conference on Parallel Architectures and Compilation Techniques

September 19 - 23, 2012

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
922
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Laghari MLiu YPanwar GBears DJearls CSrinivas RChoukse ECameron KButt AJian X(2024)Memory Allocation Under Hardware Compression2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00075(966-982)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00075
Buyuktosunoglu ATrilla DAbali BBerger DWalters CLee J(2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00080
Flint CHuppé AHelluy PBramas BGenaud S(2024) Using the Discrete Wavelet Transform for Lossy On‐the‐Fly Compression of GPU Fluid Simulations International Journal for Numerical Methods in Fluids10.1002/fld.534497:3(283-298)Online publication date: 27-Oct-2024
https://doi.org/10.1002/fld.5344
Giannoula CHuang KTang JKoziris NGoumas GChishti ZVijaykumar N(2023)DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794457:1(1-36)Online publication date: 2-Mar-2023
https://doi.org/10.1145/3579445
Renz MLal S(2023)Beyond Compression Ratio: A Throughput Analysis of Memory Compression Techniques for GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00047(255-262)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00047
Eldstål-Ahrens AArelakis ASourdis I(2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.1145/3481641
Panwar GLaghari MBears DLiu YJearls CChoukse ECameron KButt AJian XHardavellas NCampanoni SGrot BKarpuzcu U(2022)Translation-Optimized Memory Compression for CapacityProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00073(992-1011)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00073
Lal SLucas JJuurlink B(2021)QSLC: Quantization-Based, Low-Error Selective Approximation for GPUs2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474124(475-480)Online publication date: 1-Feb-2021
https://doi.org/10.23919/DATE51398.2021.9474124
Eldstål-Ahrens ASourdis I(2020)MemSZACM Transactions on Architecture and Code Optimization10.1145/342466817:4(1-25)Online publication date: 10-Nov-2020
https://dl.acm.org/doi/10.1145/3424668
Angerd ASintorn EStenstrom P(2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404431
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten