skip to main content
10.1145/1632149.1632157acmotherconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Evaluating text preprocessing to improve compression on maillogs

Published:12 October 2009Publication History

ABSTRACT

Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regulations and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text preprocessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.

References

  1. J. Abel and W. Teahan. Text preprocessing for data compression. In IEEE Data Compression Conference, 1998.Google ScholarGoogle Scholar
  2. J. Babbin, D. Kleiman, E. C. J. J. Faircloth, and M. Burnett. Security Log Management. Syngress, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Research report, Digital Systems Research Center, May 1994.Google ScholarGoogle Scholar
  4. J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, April 1984.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Deutsch. DEFLATE Compressed Data Format Specification version 1.3. Request for Comments: 1951, May 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Deutsch. Gzip file specification version 4.3. Request for Comments: 1952, May 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. G. Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. PhD thesis, Department of Computer Science, Brown University, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. A. Huffman. A Method for Construction of Minimum-Redundancy Codes. In Proceedings of the IRE, volume 40, pages 1098--1101, September 1952.Google ScholarGoogle ScholarCross RefCross Ref
  9. G. N. N. Martin. Range encoding: an algorithm for removing redundancy from a digitised message. In Video and Data Recording Conference, July 1979.Google ScholarGoogle Scholar
  10. A. Moffat. Implementing the ppm data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, November 1990.Google ScholarGoogle ScholarCross RefCross Ref
  11. F. Otten, B. Irwin, and H. Slay. Evaluating compression as an enabler for centralised mintoring and control of the network and services in a next generation network. In Proceedings of SATNAC, 2007.Google ScholarGoogle Scholar
  12. I. Pavlov. 7z format. http://www.7-zip.org/7z.html, 2006.Google ScholarGoogle Scholar
  13. S. Read-Miller. Security Management: A New Model to Align Security With Business Needs. Computer Associates White Paper, 2005.Google ScholarGoogle Scholar
  14. D. Shkarin. PPMd 9.1-12 System Manual Page.Google ScholarGoogle Scholar
  15. D. Shkarin. PPM: One step to practicality. In Proceedings of 12th IEEE Data Compression Conference (DCC), pages 202--211, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Unknown. 7zip system documentation. DOCS/MANUAL/switches/method.htm, November 2005.Google ScholarGoogle Scholar
  17. I. H. Witten, R. M. Neal, and J. G. Clearly. Arithmetic Coding for Data Compression. Communications of the ACM, 30(30):520--540, June 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions of Information Theory, 23(3):337--343, May 1977.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Ziv and A. Lempel. Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions of Information Theory, 24(5):530--536, September 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating text preprocessing to improve compression on maillogs

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Other conferences
                SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
                October 2009
                225 pages
                ISBN:9781605586434
                DOI:10.1145/1632149

                Copyright © 2009 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 12 October 2009

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                Overall Acceptance Rate187of439submissions,43%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader