research-article

Warped-compression: enabling power efficient GPUs through register compression

Authors:

Murali AnnavaramAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 502 - 514

https://doi.org/10.1145/2749469.2750417

Published: 13 June 2015 Publication History

Abstract

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

References

[1]

CACTI: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. {Online}. Available: http://www:hpl:hp:com/research/cacti/

[2]

FreePDK Process Design Kit. {Online}. Available: http://www:eda:ncsu:edu/wiki/FreePDK

[3]

NVIDIA, CUDA C Programming Guide.

[4]

NVIDIA, OpenCL Programming Guide.

[5]

Parboil Benchmarks. {Online}. Available: http://impact:crhc:illinois:edu/Parboil/parboil:aspx

[6]

Whitepaper: NVIDIA GeForce GTX 980.

[7]

Whitepaper: NVIDIAs Next Generation CUDA Compute Architecture: Fermi.

[8]

Whitepaper: NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110.

[9]

M. Abdel-Majeed and M. Annavaram, "Warped Register File: A Power Efficient Register File for GPGPUs," in Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture, 2013, pp. 412--423.

Digital Library

[10]

A. R. Alameldeen and D. A. Wood, "Adaptive Cache Compression for High-Performance Processors," in Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.

Digital Library

[11]

A. Arelakis and P. Stenstrom, "SC2: A Statistical Compression Cache Scheme," in Proceedings of the 41st Annual International Symposium on Computer Architecuture, 2014, pp. 145--156.

Digital Library

[12]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of IEEE Int'l Symp. Performance Analysis of Systems and Software, 2009, pp. 163--174.

[13]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of IEEE International Symposium on Workload Characterization, 2009, pp. 44--54.

Digital Library

[14]

X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, "C-pack: A High-performance Microprocessor Cache Compression Algorithm," IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 8, pp. 1196--1208, Aug. 2010.

Digital Library

[15]

Z. Chen, D. Kaeli, and N. Rubin, "Characterizing scalar opportunities in GPGPU applications," in Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013, pp. 225--234.

[16]

S. Collange, D. Defour, and Y. Zhang, "Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations," in Euro-Par 2009 - Parallel Processing Workshops, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, vol. 6043, pp. 46--55.

Digital Library

[17]

S. Collange and A. Kouyoumdjian, "Affine Vector Cache for memory bandwidth savings," Universite de Lyon, Tech. Rep., 2011.

[18]

B. Dally, "The Future of GPU Computing," in Proceedings of the 22nd Annual Supercomputing Conference, 2009.

[19]

L. P. Deutsch, "DEFLATE Compressed Data Format Specification Version 1.3," 1996.

[20]

W. Dweik, M. Abdel-Majeed, and M. Annavaram, "Warped-Shield: Tolerating Hard Faults in GPGPUs," in Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, pp. 431--442.

Digital Library

[21]

M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011, pp. 235--246.

Digital Library

[22]

M. Gebhart, S. W. Keckler, and W. J. Dally, "A Compile-time Managed Multi-level Register File Hierarchy," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 465--476.

Digital Library

[23]

S. Gilani, N. S. Kim, and M. Schulte, "Power-efficient computing for compute-intensive GPGPU applications," in Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013, pp. 330--341.

Digital Library

[24]

R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero, "A Content Aware Integer Register File Organization," in Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004, pp. 314--324.

Digital Library

[25]

X. Guan and Y. Fei, "Register File Partitioning and Recompilation for Register File Power Reduction," ACM Trans. Des. Autom. Electron. Syst., vol. 15, no. 3, pp. 24:1--24:30, Jun. 2010.

Digital Library

[26]

P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton, "Haswell: The Fourth-Generation Intel Core Processor," IEEE Micro, vol. 34, no. 2, pp. 6--20, Mar 2014.

[27]

H. Jeon and M. Annavaram, "Warped-DMR: Light-weight Error Detection for GPGPU," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 37--47.

Digital Library

[28]

N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang, "An Energy-efficient and Scalable eDRAM-based Register File Architecture for GPGPU," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 344--355.

Digital Library

[29]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.

Digital Library

[30]

J. Kim, C. Torng, S. Srinath, D. Lockhart, and C. Batten, "Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 130--141.

Digital Library

[31]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill et al., "Exascale Computing Study: Technology Challenges in Achieving Exascale Systems," 2008.

[32]

M. Kondo and H. Nakamura, "A small, Fast and Low-power Register File by Bit-partitioning," in Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005, pp. 40--49.

Digital Library

[33]

Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Convergence and scalarization for data-parallel architectures," in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013, pp. 1--11.

Digital Library

[34]

C. Lefurgy, P. Bird, I.-C. Chen, and T. Mudge, "Improving Code Density Using Compression Techniques," in Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture, 1997, pp. 194--203.

Digital Library

[35]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 487--498.

Digital Library

[36]

J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink, "How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator," in Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013, pp. 97--106.

[37]

S. Molnar, B. Schneider, J. Montrym, J. Van Dyke, and S. Lew, "System and Method for Real-time Compression of Pixel Colors," Nov. 30 2004, uS Patent 6,825,847.

[38]

M. R. Nelson, "LZW Data Compression," Dr. Dobb's Journal, vol. 14, no. 10, pp. 29--36, 1989.

Digital Library

[39]

S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie, "Bypass Aware Instruction Scheduling for Register File Power Reduction," in Proceedings of the 2006 ACM SIGPLAN/SIGBED Conference on Language, Compilers, and Tool Support for Embedded Systems, 2006, pp. 173--181.

Digital Library

[40]

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Base-Delta-Immediate Compression: Practical Data Compression for On-chip Caches," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012, pp. 377--388.

Digital Library

[41]

T. Rogers, M. O'Connor, and T. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 72--83.

Digital Library

[42]

K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore, "Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture," in Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003, pp. 422--433.

Digital Library

[43]

S. Sardashti and D. Wood, "Decoupled Compressed Cache: Exploiting Spatial Locality for Energy Optimization," IEEE Micro, vol. 34, no. 3, pp. 91--99, May 2014.

[44]

V. Sathish, M. J. Schulte, and N. S. Kim, "Lossless and Lossy Memory I/O Link Compression for Improving Performance of GPGPU Workloads," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012, pp. 325--334.

Digital Library

[45]

Y. Sazeides and J. Smith, "The Predictability of Data Values," in Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture, 1997, pp. 248--258.

Digital Library

[46]

A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis, "MemZip: Exploring unconventional benefits from memory compression," in Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture, 2014, pp. 638--649.

[47]

A. T. Tran and B. M. Baas, "Design of an Energy-efficient 32-bit Adder Operating at Subthreshold Voltages in 45-nm CMOS," in Proceedings of the 2010 Third International Conference on Communications and Electronics. IEEE, 2010, pp. 87--91.

[48]

Q. Xu and M. Annavaram, "PATS: Pattern Aware Scheduling and Power Gating for GPGPUs," in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 225--236.

Digital Library

[49]

Y. Yang, P. Xiang, M. Mantor, N. Rubin, L. Hsu, Q. Dong, and H. Zhou, "A Case for a Flexible Scalar Unit in SIMT Architecture," in Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 93--102.

Digital Library

[50]

W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM Hybrid Memory with Applications to Efficient Register Files in Fine-grained Multi-threading," in Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011, pp. 247--258.

Digital Library

[51]

Y. Zhang, J. Yang, and R. Gupta, "Frequent Value Locality and Value-centric Data Cache Design," in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Jung DRuttenberg MGao PDavidson SPetrisko DLi KKamath ACheng LXie SPan PZhao ZYue ZVeluri BMuralitharan SSampson ALumsdaine AZhang ZBatten COskin MRichmond DTaylor M(2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00061
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Show More Cited By

Recommendations

Warped-compression: enabling power efficient GPUs through register compression
ISCA'15

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic ...
Warped register file: A power efficient register file for GPGPUs
HPCA '13: Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

General purpose graphics processing units (GPGPUs) have the ability to execute hundreds of concurrent threads. To support massive parallelism GPGPUs provide a very large register file, even larger than a cache, to hold the state of each thread. As ...
Warped-DMR: Light-weight Error Detection for GPGPU
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
1,277
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)7

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Jung DRuttenberg MGao PDavidson SPetrisko DLi KKamath ACheng LXie SPan PZhao ZYue ZVeluri BMuralitharan SSampson ALumsdaine AZhang ZBatten COskin MRichmond DTaylor M(2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00061
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Lastovetsky AManumachu R(2023)Energy-Efficient Parallel Computing: Challenges to ScalingInformation10.3390/info1404024814:4(248)Online publication date: 20-Apr-2023
https://doi.org/10.3390/info14040248
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Zhang YWang MWang WYu Z(2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
https://doi.org/10.1016/j.mejo.2023.105825
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
https://doi.org/10.1007/978-981-15-6401-7_66-2
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
https://doi.org/10.1007/978-981-15-6401-7_66-1
Chen ZShang SWu QXue JShen ZShao ZGrosser TLee K(2022)An old friend is better than two new ones: dual-screen AndroidProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535071(86-98)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535071
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten