Skip to main content
Log in

A study of the performance of general compressors on log files

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Large-scale software systems and cloud services continue to produce a large amount of log data. Such log data is usually preserved for a long time (e.g., for auditing purposes). General compressors, like the LZ77 compressor used in gzip, are usually used in practice to compress log data to reduce the cost of long-term storage. However, such general compressors do not consider the unique nature of log data. In this paper, we study the performance of general compressors on compressing log data relative to their performance on compressing natural language data. We used 12 widely used general compressors to compress nine log files that are collected based on surveying prior literature on text compression, log compression and log analysis. We observe that log data is more repetitive than natural language data, and that log data can be compressed and decompressed faster with higher compression ratios. Besides, the compressor with the highest compression ratio for natural language data is rarely the one for log data. Nevertheless, the compressors with the highest compression ratio for log data are rarely adopted in practice by current logging libraries and log management tools. We also observe that the peak compression and decompression speeds of general compressors on log data is often achieved with a small data size, while such size may not be used by log management tools. Finally, we observe that the optimal compression performance (measured by a combined compression performance score) of log data usually requires the compression level to be configured higher than the default level. Our findings call for careful consideration of choosing general compressors and their associated compression levels for log data in practice. In addition, our findings shed lights on the opportunities for future research on compressors that better suit the characteristics of log data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Listing 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Our replication package is available at: https://github.com/SAILResearch/suppmaterial-20-kundi-log_compression Password: SAILResearchLogCompression We will open the access to this package on a GitHub repository when the paper is accepted.

  2. http://lucene.apache.org/core/7_7_0/core/org/apache/lucene/codecs/lucene50/Lucene50StoredFieldsFormat.html

  3. https://logback.qos.ch/manual/appenders.html

  4. https://jira.qos.ch/browse/LOGBACK-783

  5. https://dl.acm.org/

  6. https://ieeexplore.ieee.org/Xplore/home.jsp

  7. https://link.springer.com/

  8. https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html

  9. https://dumps.wikimedia.org/enwiki/

  10. We also included our used compressor implementations in our replication package.

  11. https://github.com/giampaolo/psutil

  12. The selected compression tools gzip, bzip2, and 7zip_ppmd implement the nine compression levels of LZ77, BWT, and PPMD, respectively.

References

  • Aceto G, Botta A, Pescapé A, Westphal C (2013) Efficient storage and processing of high-volume network monitoring data. IEEE Trans Netw Serv Manag 10(2):162–175

    Article  Google Scholar 

  • Augeri CJ, Bulutoglu DA, Mullins BE, Baldwin RO, Baird III LC (2007) An analysis of xml compression efficiency. In: Proceedings of the 2007 workshop on experimental computer science, vol 7. ACM

  • Awan FS, Mukherjee A (2001) Lipt: a lossless text transform to improve compression. In: Proceedings international conference on information technology: coding and computing. IEEE, pp 452–460

  • Balakrishnan R, Sahoo RK (2006) Lossless compression for large scale cluster logs. In: Proceedings 20th IEEE international parallel & distributed processing symposium. IEEE, p 435

  • Champagne J (2018) Behind the magnifying glass: how search works. https://static.rainfocus.com/splunk/splunkconf18/sess/1523558790516001KfjM/finalPDF/Behind-The-Magnifying-Glass-1734_1538786592130001CBKR.pdf

  • Chen B, Jiang ZMJ (2017) Characterizing and detecting anti-patterns in the logging code. In: Proceedings of the 39th international conference on software engineering, ICSE’17. IEEE Press, pp 71–81

  • Christensen R, Li F (2013) Adaptive log compression for massive log data. In: SIGMOD conference. ACM, pp 1283–1284

  • Compression (2019) Compression programs. https://maximumcompression.com/programs.php

  • Deorowicz S, Grabowski S (2008) Sub-atomic field processing for improved web log compression. In: 2008 proceedings of international conference on modern problems of radio engineering, telecommunications and computer science. IEEE, pp 551–556

  • Elastic (2019) What is the ELK stack? https://www.elastic.co/elk-stack. Accessed 04 Jul 2019

  • ELK Stack (2019) The elk stack. https://aws.amazon.com/elasticsearch-service/the-elk-stack/

  • Feng B, Wu C, Li J (2016) Mlc: an efficient multi-level log compression method for cloud backup systems. In: Trustcom/bigdataSE/ISPA, 2016 IEEE. IEEE, pp 1358–1365

  • Fenwick P (1996) Block sorting text compression. Australian Computer Science Communications 18:193–202

    Google Scholar 

  • Fu Q, Lou JG, Wang Y, Li J (2009) Execution anomaly detection in distributed systems through unstructured log analysis. In: 2009 ninth IEEE international conference on data mining, ICDM’09. IEEE, pp 149–158

  • Gupta R, Gupta RK (2012) A modified efficient log file compression mechanism for digital forensic in web environment. International Journal of Computer Science and Information Technologies

  • Hassan A, Martin D, Flora P, Mansfield P, Dietz D (2008) An industrial case study of customizing operational profiles using log compression. In: 2008 ACM/IEEE 30th international conference on software engineering. IEEE, pp 713–723

  • Hätönen K, Boulicaut JF, Klemettinen M, Miettinen M, Masson C (2003) Comprehensive log compression with frequent patterns. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 360–370

  • He P, Chen Z, He S, Lyu MR (2018) Characterizing the natural language descriptions in software logging statements. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 178–189

  • He S, Lin Q, Lou JG, Zhang H, Lyu MR, Zhang D (2018) Identifying impactful service system problems via log analysis. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/FSE’18. ACM, pp 60–70

  • He S, Zhu J, He P, Lyu MR (2016) Experience report: system log analysis for anomaly detection. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, pp 207–218

  • Herbold S (2017) Comments on scottknottesd in response to an empirical comparison of model validation techniques for defect prediction models. IEEE Trans Software Eng 43(11):1091–1094. https://doi.org/10.1109/TSE.2017.2748129

    Article  Google Scholar 

  • Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering, ICSE’12, pp 837–847

  • Hutter M (2019) 50’000 prize for compressing human knowledge (widely known as the hutter prize). http://prize.hutter1.net/

  • Jelihovschi EG, Faria JC, Allaman IB (2014) Scottknott: a package for performing the scott-knott clustering algorithm in r. TEMA (São Carlos) 15(1):3–17

    Article  MathSciNet  Google Scholar 

  • Jiang ZM, Avritzer A, Shihab E, Hassan AE, Flora P (2010) An industrial case study on speeding up user acceptance testing by mining execution logs. In: 2010 fourth international conference on secure software integration and reliability improvement, SSIRI’10. IEEE, pp 131–140

  • Jiang ZM, Hassan AE, Hamann G, Flora P (2008) An automated approach for abstracting execution logs to execution events. Journal of Software Maintenance and Evolution: Research and Practice. pp 249–267

  • Jiang ZM, Hassan AE, Hamann G, Flora P (2008) Automatic identification of load testing problems. In: Proceedings of the 2008 IEEE international conference on software maintenance, ICSM’08. IEEE, pp 307–316

  • Jurafsky D (2000) Speech & language processing. Pearson Education India

  • Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Lemoudden M, El Ouahidi B (2015) Managing cloud-generated logs using big data technologies. In: 2015 international conference on wireless networks and mobile communications, WINCOM’15. IEEE, pp 1–7

  • Li H, Chen THP, Hassan AE, Nasser M, Flora P (2018) Adopting autonomic computing capabilities in existing large-scale systems: an industrial experience report. In: Proceedings of the 40th international conference on software engineering: software engineering in practice, ICSE-SEIP’18, pp 1–10

  • Li H, Shang W, Zou Y, Hassan AE (2017) Towards just-in-time suggestions for log changes. Empir Softw Eng 22(4):1831–1865

    Article  Google Scholar 

  • Lin H, Zhou J, Yao B, Guo M, Li J (2015) Cowic: a column-wise independent compression for log stream analysis. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, CCGrid’15. IEEE, pp 21–30

  • Lin Q, Hsieh K, Dang Y, Zhang H, Sui K, Xu Y, Lou JG, Li C, Wu Y, Yao R, et al. (2018) Predicting node failure in cloud service systems. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, ESEC/FSE’18. ACM, pp 480–490

  • log4j2 (2019) Apache log4j 2. https://logging.apache.org/log4j/2.x/

  • logback (2019) Logback project. https://logback.qos.ch/

  • Lou JG, Fu Q, Wang Y, Li J (2010) Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Oper Sys Rev 44(1):91–96

    Article  Google Scholar 

  • Lou JG, Fu Q, Yang S, Xu Y, Li J (2010) Mining invariants from console logs for system problem detection. In: USENIX Annual technical conference. ACM, pp 1–14

  • Mahoney M (2011) Large text compression benchmark. http://www.mattmahoney.net/text/text.html

  • Mahoney M (2012) Data compression explained. mattmahoney. net, updated May 7, 1

  • Mahoney MV (2000) Fast text compression with neural networks. In: FLAIRS conference, pp 230–234

  • Mahoney M (2019a) Large text compression benchmark. http://mattmahoney.net/dc/text.html

  • Mahoney M (2019b) Summary of the multiple file compression benchmark tests. https://www.maximumcompression.com/data/summary_mf.php

  • Mariani L, Pastore F (2008) Automated identification of failure causes in system logs. In: 2008 19th international symposium on software reliability engineering, ISSRE’08. IEEE, pp 117–126

  • Mell P, Harang RE (2014) Lightweight packing of log files for improved compression in mobile tactical networks. In: Military communications conference (MILCOM), 2014 IEEE. IEEE, pp 192–197

  • Nagaraj K, Killian C, Neville J (2012) Structured comparative analysis of systems logs to diagnose performance problems. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’17, USENIX Association, pp 26–26

  • Navarro G (2016) Compact data structures: a practical approach, chapter 11.2. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: 37Th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07). IEEE, pp 575–584

  • Oliner AJ, Aiken A, Stearley J (2008) Alert detection in system logs. In: 2008 eighth IEEE international conference on data mining. IEEE, pp 959–964

  • Otten FJ, et al. (2008) Using semantic knowledge to improve compression on log files. Ph.D. thesis, Rhodes University

  • Otten F, Irwin B, Thinyane H (2009) Evaluating text preprocessing to improve compression on maillogs. In: Proceedings of the 2009 annual research conference of the South African institute of computer scientists and information technologists, SAICSIT’09. ACM, pp 44–53

  • Pankaj P, Charley R (2018) Market guide for AIOps platforms. https://www.gartner.com/doc/3892967/market-guide-aiops-platforms. Last Accessed 17 April 2019

  • Rácz B, Lukács A (2004) High density compression of log files. In: Data compression conference, 2004, DCC’04. IEEE, p 557

  • Sarbanes P (2002) Sarbanes-Oxley Act of 2002. In: The public company accounting reform and investor protection act

  • Sayood K (2017) Introduction to data compression. Morgan Kaufmann, San Mateo

    MATH  Google Scholar 

  • Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics, pp 507–512

  • Shannon CE (1948) A mathematical theory of communication. Bell Sys Tech J 27(3):379–423

    Article  MathSciNet  Google Scholar 

  • Skibiński P, Swacha J (2007) Fast and efficient log file compression. In: CEUR workshop proceedings of the 11th East-European conference on advances in databases and information systems, ADBIS’07. ACM, pp 330–342

  • slf4j (2019) Simple logging facade for java (slf4j). https://www.slf4j.org/

  • Splunk (2019) Siem, aiops, application management, log management, machine learning, and compliance. https://www.splunk.com/

  • Sree PK, Babu IR, et al. (2013) Felfcnca: fast & efficient log file compression using non linear cellular automata classifier. arXiv:1312.1889

  • Stearley J, Oliner AJ (2008) Bad words: finding faults in spirit’s syslogs. In: 2008 eighth IEEE international symposium on cluster computing and the grid (CCGRID). IEEE, pp 765–770

  • Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser M, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: Proceedings of the 29th IEEE international conference on software maintenance, ICSM’13. IEEE, pp 110–119

  • Tan J, Kavulya S, Gandhi R, Narasimhan P (2010) Visual, log-based causal tracing for performance debugging of mapreduce systems. In: 2010 IEEE 30th international conference on distributed computing systems, ICDCS’10. IEEE, pp 795–806

  • Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2008) Salsa: analyzing logs as state machines. WASL 8:6–6

    Google Scholar 

  • Tan J, Pan X, Kavulya S, Ghandi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Software Eng 43(1):1–18

    Article  Google Scholar 

  • Wikipedia (2019) Hutter prize. https://en.wikipedia.org/wiki/Hutter_Prize

  • Willems FM, Shtarkov YM, Tjalkens TJ (1995) The context-tree weighting method: basic properties. IEEE Trans Inf Theory 41(3):653–664

    Article  Google Scholar 

  • Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, Citeseer, p 38

  • Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Online system problem detection by mining patterns of console logs. In: 2009 ninth IEEE international conference on data mining, ICDM’09. IEEE, pp 588–597

  • Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S. (2010) Sherlog: error diagnosis by connecting clues from run-time logs. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems, ASPLOS’10, pp 143–154

  • Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: error diagnosis by connecting clues from run-time logs. In: ACM SIGARCH computer architecture news, vol 38. ACM, pp 143–154

  • Yuan D, Park S, Huang P, Liu Y, Lee MM, Tang X, Zhou Y, Savage S (2012) Be conservative: enhancing failure diagnosis with proactive logging. In: Proceedings of the 10th usenix conference on operating systems design and implementation, OSDI’12, USENIX Association, pp 293–306

  • Yuan D, Park S, Zhou Y (2012) Characterizing logging practices in open-source software. In: Proceedings of the 34th international conference on software engineering, ICSE’12. IEEE Press, pp 102–112

  • Yuan D, Zheng J, Park S, Zhou Y, Savage S (2012) Improving software diagnosability via log enhancement. ACM Trans Comput Syst 30(1):4

    Article  Google Scholar 

  • Zhu J, He P, Fu Q, Zhang H, Lyu MR, Zhang D (2015) Learning to log: helping developers make informed logging decisions. In: Proceedings of the 37th international conference on software engineering-volume 1, ICSE’15. IEEE Press, pp 415–425

  • Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, Lyu MR (2019) Tools and benchmarks for automated log parsing. In: Proceedings of the 41st international conference on software engineering: software engineering in practice, ICSE-SEIP’19. IEEE Press, pp 121–130

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kundi Yao.

Additional information

Communicated by: Paolo Tonella

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, K., Li, H., Shang, W. et al. A study of the performance of general compressors on log files. Empir Software Eng 25, 3043–3085 (2020). https://doi.org/10.1007/s10664-020-09822-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09822-x

Keywords

Navigation