research-article

Evaluating text preprocessing to improve compression on maillogs

Authors:
Fred Otten

Rhodes University, Grahamstown, South Africa

Rhodes University, Grahamstown, South Africa
View Profile

,
Barry Irwin

Rhodes University, Grahamstown, South Africa

Rhodes University, Grahamstown, South Africa
View Profile

,
Hannah Thinyane

Rhodes University, Grahamstown, South Africa

Rhodes University, Grahamstown, South Africa
View Profile

SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information TechnologistsOctober 2009Pages 44–53https://doi.org/10.1145/1632149.1632157

Published:12 October 2009Publication History

SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

Pages 44–53

ABSTRACT

Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regulations and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text preprocessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.

References

J. Abel and W. Teahan. Text preprocessing for data compression. In IEEE Data Compression Conference, 1998.Google Scholar
J. Babbin, D. Kleiman, E. C. J. J. Faircloth, and M. Burnett. Security Log Management. Syngress, 2006. Google ScholarDigital Library
M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Research report, Digital Systems Research Center, May 1994.Google Scholar
J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, April 1984.Google ScholarCross Ref
P. Deutsch. DEFLATE Compressed Data Format Specification version 1.3. Request for Comments: 1951, May 1996.Google ScholarDigital Library
P. Deutsch. Gzip file specification version 4.3. Request for Comments: 1952, May 1996.Google ScholarDigital Library
P. G. Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. PhD thesis, Department of Computer Science, Brown University, June 1993. Google ScholarDigital Library
D. A. Huffman. A Method for Construction of Minimum-Redundancy Codes. In Proceedings of the IRE, volume 40, pages 1098--1101, September 1952.Google ScholarCross Ref
G. N. N. Martin. Range encoding: an algorithm for removing redundancy from a digitised message. In Video and Data Recording Conference, July 1979.Google Scholar
A. Moffat. Implementing the ppm data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, November 1990.Google ScholarCross Ref
F. Otten, B. Irwin, and H. Slay. Evaluating compression as an enabler for centralised mintoring and control of the network and services in a next generation network. In Proceedings of SATNAC, 2007.Google Scholar
I. Pavlov. 7z format. http://www.7-zip.org/7z.html, 2006.Google Scholar
S. Read-Miller. Security Management: A New Model to Align Security With Business Needs. Computer Associates White Paper, 2005.Google Scholar
D. Shkarin. PPMd 9.1-12 System Manual Page.Google Scholar
D. Shkarin. PPM: One step to practicality. In Proceedings of 12th IEEE Data Compression Conference (DCC), pages 202--211, 2002. Google ScholarDigital Library
Unknown. 7zip system documentation. DOCS/MANUAL/switches/method.htm, November 2005.Google Scholar
I. H. Witten, R. M. Neal, and J. G. Clearly. Arithmetic Coding for Data Compression. Communications of the ACM, 30(30):520--540, June 1987. Google ScholarDigital Library
J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions of Information Theory, 23(3):337--343, May 1977.Google ScholarDigital Library
J. Ziv and A. Lempel. Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions of Information Theory, 24(5):530--536, September 1978.Google ScholarDigital Library

Index Terms

Evaluating text preprocessing to improve compression on maillogs

Recommendations

Universal Text Preprocessing for Data Compression

Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared ...
Read More
Extending Huffman coding for multilingual text compression
DCC '95: Proceedings of the Conference on Data Compression

Summary form only given. We propose two new algorithms that are based on the 16-bit or 32-bit sampling character set and on the unique features of languages with a large number of distinct characters to improve the data compression ratios for ...
Read More
Asymmetric lossless image compression
DCC '95: Proceedings of the Conference on Data Compression

Summary form only given. Lossless image compression is often required in situations where compression is done once and decompression is to be performed a multiple number of times. Since compression is to be performed only once, time taken for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
October 2009
225 pages
ISBN:9781605586434
DOI:10.1145/1632149
Conference Chair:
Barry Dwolatzky,
Program Chairs:
Jason Cohen,
Scott Hazelhurst
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
data compression
digital forensics
log management
security and trust management
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate187of439submissions,43%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 171
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating text preprocessing to improve compression on maillogs

SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

ABSTRACT

References

Cited By

Index Terms

Recommendations

Universal Text Preprocessing for Data Compression

Extending Huffman coding for multilingual text compression

Asymmetric lossless image compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating text preprocessing to improve compression on maillogs

SAICSIT '09: Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

ABSTRACT

References

Cited By

Index Terms

Recommendations

Universal Text Preprocessing for Data Compression

Extending Huffman coding for multilingual text compression

Asymmetric lossless image compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media