skip to main content
10.1145/2749469.2750417acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Warped-compression: enabling power efficient GPUs through register compression

Published: 13 June 2015 Publication History

Abstract

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

References

[1]
CACTI: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. {Online}. Available: http://www:hpl:hp:com/research/cacti/
[2]
FreePDK Process Design Kit. {Online}. Available: http://www:eda:ncsu:edu/wiki/FreePDK
[3]
NVIDIA, CUDA C Programming Guide.
[4]
NVIDIA, OpenCL Programming Guide.
[5]
Parboil Benchmarks. {Online}. Available: http://impact:crhc:illinois:edu/Parboil/parboil:aspx
[6]
Whitepaper: NVIDIA GeForce GTX 980.
[7]
Whitepaper: NVIDIAs Next Generation CUDA Compute Architecture: Fermi.
[8]
Whitepaper: NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110.
[9]
M. Abdel-Majeed and M. Annavaram, "Warped Register File: A Power Efficient Register File for GPGPUs," in Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture, 2013, pp. 412--423.
[10]
A. R. Alameldeen and D. A. Wood, "Adaptive Cache Compression for High-Performance Processors," in Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004.
[11]
A. Arelakis and P. Stenstrom, "SC2: A Statistical Compression Cache Scheme," in Proceedings of the 41st Annual International Symposium on Computer Architecuture, 2014, pp. 145--156.
[12]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of IEEE Int'l Symp. Performance Analysis of Systems and Software, 2009, pp. 163--174.
[13]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of IEEE International Symposium on Workload Characterization, 2009, pp. 44--54.
[14]
X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, "C-pack: A High-performance Microprocessor Cache Compression Algorithm," IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 8, pp. 1196--1208, Aug. 2010.
[15]
Z. Chen, D. Kaeli, and N. Rubin, "Characterizing scalar opportunities in GPGPU applications," in Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013, pp. 225--234.
[16]
S. Collange, D. Defour, and Y. Zhang, "Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations," in Euro-Par 2009 - Parallel Processing Workshops, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, vol. 6043, pp. 46--55.
[17]
S. Collange and A. Kouyoumdjian, "Affine Vector Cache for memory bandwidth savings," Universite de Lyon, Tech. Rep., 2011.
[18]
B. Dally, "The Future of GPU Computing," in Proceedings of the 22nd Annual Supercomputing Conference, 2009.
[19]
L. P. Deutsch, "DEFLATE Compressed Data Format Specification Version 1.3," 1996.
[20]
W. Dweik, M. Abdel-Majeed, and M. Annavaram, "Warped-Shield: Tolerating Hard Faults in GPGPUs," in Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, pp. 431--442.
[21]
M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011, pp. 235--246.
[22]
M. Gebhart, S. W. Keckler, and W. J. Dally, "A Compile-time Managed Multi-level Register File Hierarchy," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 465--476.
[23]
S. Gilani, N. S. Kim, and M. Schulte, "Power-efficient computing for compute-intensive GPGPU applications," in Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture, 2013, pp. 330--341.
[24]
R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero, "A Content Aware Integer Register File Organization," in Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004, pp. 314--324.
[25]
X. Guan and Y. Fei, "Register File Partitioning and Recompilation for Register File Power Reduction," ACM Trans. Des. Autom. Electron. Syst., vol. 15, no. 3, pp. 24:1--24:30, Jun. 2010.
[26]
P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton, "Haswell: The Fourth-Generation Intel Core Processor," IEEE Micro, vol. 34, no. 2, pp. 6--20, Mar 2014.
[27]
H. Jeon and M. Annavaram, "Warped-DMR: Light-weight Error Detection for GPGPU," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 37--47.
[28]
N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang, "An Energy-efficient and Scalable eDRAM-based Register File Architecture for GPGPU," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 344--355.
[29]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.
[30]
J. Kim, C. Torng, S. Srinath, D. Lockhart, and C. Batten, "Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 130--141.
[31]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill et al., "Exascale Computing Study: Technology Challenges in Achieving Exascale Systems," 2008.
[32]
M. Kondo and H. Nakamura, "A small, Fast and Low-power Register File by Bit-partitioning," in Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005, pp. 40--49.
[33]
Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Convergence and scalarization for data-parallel architectures," in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013, pp. 1--11.
[34]
C. Lefurgy, P. Bird, I.-C. Chen, and T. Mudge, "Improving Code Density Using Compression Techniques," in Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture, 1997, pp. 194--203.
[35]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 487--498.
[36]
J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink, "How a Single Chip Causes Massive Power Bills GPUSimPow: A GPGPU Power Simulator," in Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013, pp. 97--106.
[37]
S. Molnar, B. Schneider, J. Montrym, J. Van Dyke, and S. Lew, "System and Method for Real-time Compression of Pixel Colors," Nov. 30 2004, uS Patent 6,825,847.
[38]
M. R. Nelson, "LZW Data Compression," Dr. Dobb's Journal, vol. 14, no. 10, pp. 29--36, 1989.
[39]
S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie, "Bypass Aware Instruction Scheduling for Register File Power Reduction," in Proceedings of the 2006 ACM SIGPLAN/SIGBED Conference on Language, Compilers, and Tool Support for Embedded Systems, 2006, pp. 173--181.
[40]
G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Base-Delta-Immediate Compression: Practical Data Compression for On-chip Caches," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012, pp. 377--388.
[41]
T. Rogers, M. O'Connor, and T. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 72--83.
[42]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore, "Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture," in Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003, pp. 422--433.
[43]
S. Sardashti and D. Wood, "Decoupled Compressed Cache: Exploiting Spatial Locality for Energy Optimization," IEEE Micro, vol. 34, no. 3, pp. 91--99, May 2014.
[44]
V. Sathish, M. J. Schulte, and N. S. Kim, "Lossless and Lossy Memory I/O Link Compression for Improving Performance of GPGPU Workloads," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012, pp. 325--334.
[45]
Y. Sazeides and J. Smith, "The Predictability of Data Values," in Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture, 1997, pp. 248--258.
[46]
A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis, "MemZip: Exploring unconventional benefits from memory compression," in Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture, 2014, pp. 638--649.
[47]
A. T. Tran and B. M. Baas, "Design of an Energy-efficient 32-bit Adder Operating at Subthreshold Voltages in 45-nm CMOS," in Proceedings of the 2010 Third International Conference on Communications and Electronics. IEEE, 2010, pp. 87--91.
[48]
Q. Xu and M. Annavaram, "PATS: Pattern Aware Scheduling and Power Gating for GPGPUs," in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 225--236.
[49]
Y. Yang, P. Xiang, M. Mantor, N. Rubin, L. Hsu, Q. Dong, and H. Zhou, "A Case for a Flexible Scalar Unit in SIMT Architecture," in Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 93--102.
[50]
W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM Hybrid Memory with Applications to Efficient Register Files in Fine-grained Multi-threading," in Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011, pp. 247--258.
[51]
Y. Zhang, J. Yang, and R. Gupta, "Frequent Value Locality and Value-centric Data Cache Design," in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
  • (2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ISCA '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)7
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
  • (2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
  • (2023)Energy-Efficient Parallel Computing: Challenges to ScalingInformation10.3390/info1404024814:4(248)Online publication date: 20-Apr-2023
  • (2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
  • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
  • (2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
  • (2022)An old friend is better than two new ones: dual-screen AndroidProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535071(86-98)Online publication date: 14-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media