skip to main content
article
Open access

PPMexe: Program compression

Published: 01 January 2007 Publication History

Abstract

With the emergence of software delivery platforms, code compression has become an important system component that strongly affects performance. This article presents PPMexe, a compression mechanism for program binaries that analyzes their syntax and semantics to achieve superior compression ratios. We use the generic paradigm of prediction by partial matching (PPM) as the foundation of our compression codec. PPMexe combines PPM with two preprocessing steps: (i) instruction rescheduling to improve prediction rates and (ii) heuristic partitioning of a program binary into streams with high autocorrelation. We improve the traditional PPM algorithm by (iii) using an additional alphabet of frequent variable-length supersymbols extracted from the input stream of fixed-length symbols. In addition, PPMexe features (iv) a low-overhead mechanism that enables decompression starting from an arbitrary instruction of the executable, a property pivotal for runtime software delivery. We implemented PPMexe for x86 binaries and tested it on several large applications. Binaries compressed using PPMexe were 18--24% smaller than files created using off-the-shelf PPMD, one of the best available compressors

References

[1]
Araujo, G., Centoducatte, P., Azevedo, R., and Pannain, R. 2000. Expression tree based algorithms for code compression on embedded RISC architectures. IEEE Trans. Very Large Scale Integration Syst. 8, 5, 530--533.
[2]
Baker, B. S. and Manber, U. 1998. Deducing similarities in Java sources from bytecodes. In Proceedings of the USENIX Annual Technical Conference. 179--190.
[3]
Bunton, S. 1997. Semantically motivated improvements for PPM variants. Comput. J. 40, 2/3, 76--93.
[4]
Burrows, M. and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep., Digital Equipment Corporation.
[5]
Burtscher, M., Ganusov, I., Jackson, S. J., Ke, J., Ratanaworabhan, P., and Sam, N. B. The VPC trace-compression algorithms. IEEE Trans. Comput. 54, 11.
[6]
Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences. J. ACM 13, 4, 547--569.
[7]
Chaitin, G. J. 1969. On the length of programs for computing finite binary sequences: Statistical considerations. J. ACM 16, 1, 145--159.
[8]
Cleary, J. and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 4, 396--402.
[9]
Debray, S., Evans, W., and Muth, R. 2000. Compiler techniques for code compaction. ACM Trans. Program. Lang. Syst. 22, 2, 378--415.
[10]
Ernst, J., Evans, W., Fraser, C., Lucco, S., and Proebsting, T. 1997. Code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 358--365.
[11]
Franz, M. and Kistler, T. 1997. Slim binaries. Commun. ACM 40, 12, 87--94.
[12]
Fraser, C. 1999. Automatic inference of models for statistical code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 242--246.
[13]
Fraser, C., Myers, E., and Wendt, A. 1984. Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN Symposium on Compiler Construction 19, 117--121.
[14]
Gilchrist, J. 2000. The archive compression test. http://compression.ca.
[15]
Hennessy, J. L. and Patterson, D. A. 1995. Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufman, San Francisco, CA.
[16]
Hoevel, L. W. and Flynn, M. J. 1977. The structure of directly executed languages: A new theory of interpretive system design. Tech. Rep. CSL-TR-77-130, Stanford University.
[17]
Hong, I., Kirovski, D., and Potkonjak, M. 1997. Potential-Driven statistical ordering of transformations. In Proceedings of the Design Automation Conference. 347--352.
[18]
Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA.
[19]
Howard, P. 1993. The design and analysis of efficient lossless data compression systems. Ph.D. thesis, Brown University.
[20]
Howard, P. and Vitter, J. 1993. Design and analysis of fast text compression based on quasi-arithmetic coding. In Proceedings of the Data Compression Conference. 98--107.
[21]
Huffman, D. 1952. A method for construction of minimum redundancy codes. Proc. IEEE 40, 1098--1101.
[22]
Intel Corp. 1999a. http://www.intel.com/design/pentiumiii.
[23]
Intel Corp. 1999b. Intel architecture software developer's manual, vol. 2: Instruction set reference manual. http://developer.intel.com/design/processor/.
[24]
Intel Corp. 2000. http://www.intel.com/design/pentium4.
[25]
Kirovski, D., Kin, J., and Mangione-Smith, W. H. 1997. Procedure based program compression. In Proceedings of the International Symposium on Microarchitecture. 204--213.
[26]
Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information. Problems Inf. Transmission 1, 1, 1--7.
[27]
Korolev, L. 1958. Coding and code compression. J. ACM 5, 4, 328--333.
[28]
Lau, J., Schoenmackers, S., Sherwood, T., and Calder, B. 2003. Reducing code size with echo instructions. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 84--94.
[29]
Lekatsas, H. and Wolf, W. 1999. Random access decompression using binary arithmetic coding. In Proceedings of the Data Compression Conference. 306--315.
[30]
Liao, S. 1996. Storage assignment to decrease code size. ACM Trans. Program. Lang. Syst. 18, 2, 235--253.
[31]
Liao, S., Devadas, S., Keutzer, K., and Tjiang, S. 1995. Instruction selection using binate covering for code size optimization. In Proceedings of the ACM IEEE International Conference on Computer-Aided Design. 393--399.
[32]
Lucco, S. 2000. Split-Stream dictionary program compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 27--34.
[33]
Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 11, 1917--1921.
[34]
Mohney, D. 2003. It's all about the last mile. http://www.theinquirer.net.
[35]
Murtagh, T. 1991. An improved storage management scheme for block structured languages. ACM Trans. Program. Lang. Syst. 13, 3, 327--398.
[36]
Proebsting, T. 1995. Optimizing a ANSI C interpreter with superoperators. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 322--332.
[37]
Pugh, W. 1999. Compressing Java class files. In Proceedings of the Programming Language Design and Implementation Conference. 247--258.
[38]
Rao, A. and Pande, S. 1999. Storage assignment optimizations to generate compact and efficient code on embedded DSPs. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 128--138.
[39]
Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465--471.
[40]
Rissanen, J. and Mohiuddin, K. M. 1989. A multiplication-free multialphabet arithmetic code. IEEE Trans. Commun. 37, 3, 129--146.
[41]
Romer, T. H., Lee, D., Voelker, G. M., Wolman, A., Wong, W. A., Baer, J.-L., Bershad, B. N., and Levy, H. M. 1996. The structure and performance of interpreters. In Proceedings of the ACM Architectural Support for Programming Languages and Operating Systems Conference. 150--159.
[42]
Shannon, C. 1951. Prediction and entropy of printed English. Bell Syst. Tech. J. 50--64.
[43]
Srivastava, A. and Vo, H. 2001. Vulcan: Binary transformation in a distributed environment. Tech. Rep. MSR-TR-2001-50. Microsoft Research.
[44]
Systa, T., Yu, P., and Muller, H. 2001. Shimba---An environment for reverse engineering Java software systems. Softw. Pract. Exper. 31, 4, 371--394.
[45]
Truman, T., Pering, T., Doering, R., and Brodersen, R. 1998. The InfoPad multimedia terminal: A portable device for wireless information access. IEEE Trans. Comput. 47, 10, 1073--1087.
[46]
Weaver, C. 2000. SPEC2000 binaries. http://www.simplescalar.org.
[47]
Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA.
[48]
Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded RISC architecture. In Proceedings of the International Symposium on Microarchitecture. 81--91.
[49]
Zhang, X. and Gupta, R. 2005. Whole execution traces and their applications. ACM Trans. Architecture Code Optimization 2, 3, 301--334.
[50]
Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. IT-24, 530--536.

Cited By

View all
  • (2024)Safety and Performance, Why Not Both? Bi-Objective Optimized Model Compression Against Heterogeneous Attacks Toward AI Software DeploymentIEEE Transactions on Software Engineering10.1109/TSE.2023.334851550:3(376-390)Online publication date: 1-Mar-2024
  • (2024)Multi-file dynamic compression method based on classification algorithm in DNA storageMedical & Biological Engineering & Computing10.1007/s11517-024-03156-262:12(3623-3635)Online publication date: 26-Jun-2024
  • (2022)Safety and Performance, Why not Both? Bi-Objective Optimized Model Compression toward AI Software DeploymentProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556906(1-13)Online publication date: 10-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 29, Issue 1
January 2007
273 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/1180475
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2007
Published in TOPLAS Volume 29, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Instruction scheduling
  2. prediction by partial matching
  3. random access compression
  4. software compression
  5. software distribution

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)19
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Safety and Performance, Why Not Both? Bi-Objective Optimized Model Compression Against Heterogeneous Attacks Toward AI Software DeploymentIEEE Transactions on Software Engineering10.1109/TSE.2023.334851550:3(376-390)Online publication date: 1-Mar-2024
  • (2024)Multi-file dynamic compression method based on classification algorithm in DNA storageMedical & Biological Engineering & Computing10.1007/s11517-024-03156-262:12(3623-3635)Online publication date: 26-Jun-2024
  • (2022)Safety and Performance, Why not Both? Bi-Objective Optimized Model Compression toward AI Software DeploymentProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556906(1-13)Online publication date: 10-Oct-2022
  • (2020)Improving the Utilization of Micro-operation Caches in x86 Processors2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00025(160-172)Online publication date: Oct-2020
  • (2017)Reducing calling convention overhead in object-oriented programming on embedded ARM thumb-2 platformsACM SIGPLAN Notices10.1145/3170492.313605752:12(146-156)Online publication date: 23-Oct-2017
  • (2017)Reducing calling convention overhead in object-oriented programming on embedded ARM thumb-2 platformsProceedings of the 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3136040.3136057(146-156)Online publication date: 23-Oct-2017
  • (2015)Generalized Context Modeling With Multi-Directional Structuring and MDL-Based Model Selection for Heterogeneous Data CompressionIEEE Transactions on Signal Processing10.1109/TSP.2015.245878463:21(5650-5664)Online publication date: Nov-2015
  • (2014)Techniques for Specialized Data CompressionEncyclopedia of Information Science and Technology, Third Edition10.4018/978-1-4666-5888-2.ch351(3590-3597)Online publication date: 31-Jul-2014
  • (2013)x86 instruction reordering for code compressionActa Cybernetica10.14232/actacyb.21.1.2013.1321:1(177-190)Online publication date: 1-Jan-2013
  • (2008)On Non-sequential Context Modeling with Application to Executable Data CompressionProceedings of the Data Compression Conference10.1109/DCC.2008.6(172-181)Online publication date: 25-Mar-2008

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media