ABSTRACT
Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regulations and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text preprocessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.
- J. Abel and W. Teahan. Text preprocessing for data compression. In IEEE Data Compression Conference, 1998.Google Scholar
- J. Babbin, D. Kleiman, E. C. J. J. Faircloth, and M. Burnett. Security Log Management. Syngress, 2006. Google ScholarDigital Library
- M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Research report, Digital Systems Research Center, May 1994.Google Scholar
- J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, April 1984.Google ScholarCross Ref
- P. Deutsch. DEFLATE Compressed Data Format Specification version 1.3. Request for Comments: 1951, May 1996.Google ScholarDigital Library
- P. Deutsch. Gzip file specification version 4.3. Request for Comments: 1952, May 1996.Google ScholarDigital Library
- P. G. Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. PhD thesis, Department of Computer Science, Brown University, June 1993. Google ScholarDigital Library
- D. A. Huffman. A Method for Construction of Minimum-Redundancy Codes. In Proceedings of the IRE, volume 40, pages 1098--1101, September 1952.Google ScholarCross Ref
- G. N. N. Martin. Range encoding: an algorithm for removing redundancy from a digitised message. In Video and Data Recording Conference, July 1979.Google Scholar
- A. Moffat. Implementing the ppm data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, November 1990.Google ScholarCross Ref
- F. Otten, B. Irwin, and H. Slay. Evaluating compression as an enabler for centralised mintoring and control of the network and services in a next generation network. In Proceedings of SATNAC, 2007.Google Scholar
- I. Pavlov. 7z format. http://www.7-zip.org/7z.html, 2006.Google Scholar
- S. Read-Miller. Security Management: A New Model to Align Security With Business Needs. Computer Associates White Paper, 2005.Google Scholar
- D. Shkarin. PPMd 9.1-12 System Manual Page.Google Scholar
- D. Shkarin. PPM: One step to practicality. In Proceedings of 12th IEEE Data Compression Conference (DCC), pages 202--211, 2002. Google ScholarDigital Library
- Unknown. 7zip system documentation. DOCS/MANUAL/switches/method.htm, November 2005.Google Scholar
- I. H. Witten, R. M. Neal, and J. G. Clearly. Arithmetic Coding for Data Compression. Communications of the ACM, 30(30):520--540, June 1987. Google ScholarDigital Library
- J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions of Information Theory, 23(3):337--343, May 1977.Google ScholarDigital Library
- J. Ziv and A. Lempel. Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions of Information Theory, 24(5):530--536, September 1978.Google ScholarDigital Library
Index Terms
- Evaluating text preprocessing to improve compression on maillogs
Recommendations
Universal Text Preprocessing for Data Compression
Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared ...
Extending Huffman coding for multilingual text compression
DCC '95: Proceedings of the Conference on Data CompressionSummary form only given. We propose two new algorithms that are based on the 16-bit or 32-bit sampling character set and on the unique features of languages with a large number of distinct characters to improve the data compression ratios for ...
Asymmetric lossless image compression
DCC '95: Proceedings of the Conference on Data CompressionSummary form only given. Lossless image compression is often required in situations where compression is done once and decompression is to be performed a multiple number of times. Since compression is to be performed only once, time taken for ...
Comments