Skip to main content
Log in

An optimal text compression algorithm based on frequent pattern mining

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • (1987) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Census+Income. Accessed 14 Jan 2017

  • (1989) Calgary and canterbury compression corpus datasets. http://corpus.canterbury.ac.nz/descriptions/. Accessed 12 Jan 2017

  • (2003) Deflate compression algorithm. http://www.pkware.com. Accessed 12 Jan 2017

  • (2005) TREC genomics track data. http://skynet.ohsu.edu/trec-gen/data/2004/. Accessed 12 Jan 2017

  • (2006) Rar compression algorithm. http://www.rarlab.com/. Accessed 12 Jan 2017

  • (2008) PAQ. http://www.cs.fit.edu/~mmahoney/compression. Accessed 02 June 2017

  • (2013) Debruijn sequence data. http://bioinf.spbau.ru/en/spadesmanual. Accessed 03 June 2017

  • (2013) Silesia dataset. http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia. Accessed 12 Jan 2017

  • (2014) Amazon reviews dataset. http://jmcauley.ucsd.edu/data/amazon/. Accessed 03 June 2017

  • Agarwal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12–15, 1994. Morgan Kaufmann, Santiago de Chile, Chile, pp 487–499

  • Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000a) Mining frequent patterns with counting inference. SIGKDD Explor Newsl 2(2):66–75. doi:10.1145/380995.381017

  • Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000b) Mining frequent patterns with counting inference. ACM SIGKDD Explor Newslett 2(2):66–75

    Article  MATH  Google Scholar 

  • Basu S, Chaturvedi S, Hegde RM (2016) Text compression using lexicographic permutation of binary strings. In: 2016 International Conference on Signal Processing and Communications (SPCOM), IEEE, pp 1–5

  • Bentley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. Commun ACM 29(4):320–330

    Article  MathSciNet  MATH  Google Scholar 

  • Bledsoe RE (1987) Data communication with modified Huffman coding

  • Borgelt C (2005) Keeping things simple: finding frequent item sets by recursive elimination. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 66–70

  • Borgelt C (2012) Frequent item set mining. Wiley Interdiscip Rev Data Min Knowl Discov 2(6):437–456

    Article  Google Scholar 

  • Brent RP (1987) A linear algorithm for data compression. Aust Comput J 19(2):64–68

    MATH  Google Scholar 

  • Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Record ACM 26:255–264

    Article  Google Scholar 

  • Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm

  • David S (2004) Data compression: the complete reference, 2nd edn. http://www.ecs.csun.edu/~dxs/DC3advertis/Dcomp3Ad.html

  • DavidA H (1952) A method for the construction of minimum redundancy codes. Proc IRE 40(9):1098–1101

  • Decaroli G, Gagie T, Manzini G (2017) A compact index for order-preserving pattern matching. In: Data Compression Conference (DCC), 2017, IEEE, pp 72–81

  • Deng ZH, Lv SL (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appld 41(10):4505–4512

    Article  Google Scholar 

  • Deng ZH, Lv SL (2015) Prepost+: an efficient n-lists-based algorithm for mining frequent itemsets via children–parent equivalence pruning. Expert Syst Appl 42(13):5424–5432

    Article  Google Scholar 

  • Deorowicz S (2003) Universal lossless data compression algorithms. Philosophy Dissertation Thesis, Gliwice

  • Feigenblat G, Porat E, Shiftan A (2016) Linear time succinct indexable dictionary construction with applications. In: Data Compression Conference (DCC), 2016, IEEE, pp 13–22

  • Goethals B (2003) Survey on frequent pattern mining, manuscript

  • Golomb S (2006) Run-length encodings (corresp.). IEEE Trans Inf Theor 12(3):399–401. doi:10.1109/TIT.1966.1053907

  • Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann

  • Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87

    Article  MathSciNet  Google Scholar 

  • Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86

    Article  MathSciNet  Google Scholar 

  • Jakobsson M (1985) Compression of character strings by an adaptive dictionary. BIT Num Math 25(4):593–603

    Article  Google Scholar 

  • Kempa D, Kosolobov D (2017) Lz-end parsing in compressed space. In: Data Compression Conference (DCC), 2017, IEEE, pp 350–359

  • Köppl D, Sadakane K (2016) Lempel-ziv computation in compressed space (lz-cics). In: Data Compression Conference (DCC), 2016, IEEE, pp 3–12

  • Labeit J, Shun J, Blelloch GE (2017) Parallel lightweight wavelet tree, suffix array and fm-index construction. J Disc Algorithms 43:2–17

    Article  MathSciNet  MATH  Google Scholar 

  • Li W, Yao Y (2016) Accelerate data compression in file system. In: Data Compression Conference (DCC), 2016, IEEE, pp 615–615

  • Lin KC, Liao IE, Chang TP, Lin SF (2014) A frequent itemset mining algorithm based on the principle of inclusion–exclusion and transaction mapping. Inf Sci 276:278–289

    Article  Google Scholar 

  • Liu J, Pan Y, Wang K, Han J (2002) Mining frequent item sets by opportunistic projection. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 229–238

  • Miller VS, Wegman MN (1985) Variations on a theme by ziv and lempel. In: Combinatorial algorithms on words, Springer, pp 131–140

  • Moffat A (1990) Implementing the ppm data compression scheme. IEEE Trans Commun 38(11):1917–1921

    Article  Google Scholar 

  • Nelson MR (1989) Lzw data compression. Dr Dobb’s J 14(10):29–36

    Google Scholar 

  • Oswald C, Ghosh AI, Sivaselvan B (2015) An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration, Springer, pp 563–575

  • Oswald C, Srinidhi S, Vishnu KS, Vishal T, Sivaselvan B (2017) Hash based frequent pattern mining approach to text compression. EAI. doi:10.4108/eai.27-2-2017.152268

    Google Scholar 

  • Park JS, Chen Ms, Yu PS (1995) An effective hash-based algorithm for mining association rules, vol 24. ACM

  • Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine: Hyper-structure mining of frequent patterns in large databases. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, IEEE, pp 441–448

  • Policriti A, Prezza N (2016) Computing lz77 in run-compressed space. In: Data Compression Conference (DCC), 2016, IEEE, pp 23–32

  • Pountain D (1987) Run-length encoding. Byte 12(6):317–319

    Google Scholar 

  • Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Data Compression Conference (DCC), 2016, IEEE, pp 231–240

  • Ramabadran TV, Gaitonde SS (1988) A tutorial on crc computations. IEEE Micro 4:62–75

    Article  Google Scholar 

  • Ramakrishnan N, Grama A (1999) Data mining: From serendipity to science—guest editors’ introduction. IEEE Comput 32(8):34–37

    Article  Google Scholar 

  • Rodeh M, Pratt VR, Even S (1981) Linear algorithm for data compression via string matching. J ACM (JACM) 28(1):16–24

    Article  MathSciNet  MATH  Google Scholar 

  • Savasere A, Omicinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases

  • Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun Rev 5(1):3–55

    Article  MathSciNet  Google Scholar 

  • Storer JA (1988) Data compression: methods and theory. Computer Science Press, Inc

  • Storer JA, Szymanski TG (1982) Data compression via textual substitution. J ACM (JACM) 29(4):928–951

    Article  MathSciNet  MATH  Google Scholar 

  • Toivonen H (1996) Sampling large databases for association rules. VLDB 96:134–145

    Google Scholar 

  • Uno T, Kiyomi M, Arimura H (2005) Lcm ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 77–86

  • Vitter JS (1987) Design and analysis of dynamic huffman codes. J ACM (JACM) 34(4):825–845

    Article  MathSciNet  MATH  Google Scholar 

  • Williams RN (1991) An extremely fast ziv-lempel data compression algorithm. In: Data Compression Conference, 1991. DCC’91., IEEE, pp 362–371

  • Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–540

    Article  Google Scholar 

  • Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390. http://dblp.uni-trier.de/db/journals/tkde/tkde12.htmlZaki00

  • Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 326–335

  • Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

  • Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Oswald.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oswald, C., Sivaselvan, B. An optimal text compression algorithm based on frequent pattern mining. J Ambient Intell Human Comput 9, 803–822 (2018). https://doi.org/10.1007/s12652-017-0540-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-017-0540-2

Keywords

Navigation