Abstract
Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
Similar content being viewed by others
References
(1987) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Census+Income. Accessed 14 Jan 2017
(1989) Calgary and canterbury compression corpus datasets. http://corpus.canterbury.ac.nz/descriptions/. Accessed 12 Jan 2017
(2003) Deflate compression algorithm. http://www.pkware.com. Accessed 12 Jan 2017
(2005) TREC genomics track data. http://skynet.ohsu.edu/trec-gen/data/2004/. Accessed 12 Jan 2017
(2006) Rar compression algorithm. http://www.rarlab.com/. Accessed 12 Jan 2017
(2008) PAQ. http://www.cs.fit.edu/~mmahoney/compression. Accessed 02 June 2017
(2013) Debruijn sequence data. http://bioinf.spbau.ru/en/spadesmanual. Accessed 03 June 2017
(2013) Silesia dataset. http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia. Accessed 12 Jan 2017
(2014) Amazon reviews dataset. http://jmcauley.ucsd.edu/data/amazon/. Accessed 03 June 2017
Agarwal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12–15, 1994. Morgan Kaufmann, Santiago de Chile, Chile, pp 487–499
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000a) Mining frequent patterns with counting inference. SIGKDD Explor Newsl 2(2):66–75. doi:10.1145/380995.381017
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000b) Mining frequent patterns with counting inference. ACM SIGKDD Explor Newslett 2(2):66–75
Basu S, Chaturvedi S, Hegde RM (2016) Text compression using lexicographic permutation of binary strings. In: 2016 International Conference on Signal Processing and Communications (SPCOM), IEEE, pp 1–5
Bentley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. Commun ACM 29(4):320–330
Bledsoe RE (1987) Data communication with modified Huffman coding
Borgelt C (2005) Keeping things simple: finding frequent item sets by recursive elimination. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 66–70
Borgelt C (2012) Frequent item set mining. Wiley Interdiscip Rev Data Min Knowl Discov 2(6):437–456
Brent RP (1987) A linear algorithm for data compression. Aust Comput J 19(2):64–68
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Record ACM 26:255–264
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm
David S (2004) Data compression: the complete reference, 2nd edn. http://www.ecs.csun.edu/~dxs/DC3advertis/Dcomp3Ad.html
DavidA H (1952) A method for the construction of minimum redundancy codes. Proc IRE 40(9):1098–1101
Decaroli G, Gagie T, Manzini G (2017) A compact index for order-preserving pattern matching. In: Data Compression Conference (DCC), 2017, IEEE, pp 72–81
Deng ZH, Lv SL (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appld 41(10):4505–4512
Deng ZH, Lv SL (2015) Prepost+: an efficient n-lists-based algorithm for mining frequent itemsets via children–parent equivalence pruning. Expert Syst Appl 42(13):5424–5432
Deorowicz S (2003) Universal lossless data compression algorithms. Philosophy Dissertation Thesis, Gliwice
Feigenblat G, Porat E, Shiftan A (2016) Linear time succinct indexable dictionary construction with applications. In: Data Compression Conference (DCC), 2016, IEEE, pp 13–22
Goethals B (2003) Survey on frequent pattern mining, manuscript
Golomb S (2006) Run-length encodings (corresp.). IEEE Trans Inf Theor 12(3):399–401. doi:10.1109/TIT.1966.1053907
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Jakobsson M (1985) Compression of character strings by an adaptive dictionary. BIT Num Math 25(4):593–603
Kempa D, Kosolobov D (2017) Lz-end parsing in compressed space. In: Data Compression Conference (DCC), 2017, IEEE, pp 350–359
Köppl D, Sadakane K (2016) Lempel-ziv computation in compressed space (lz-cics). In: Data Compression Conference (DCC), 2016, IEEE, pp 3–12
Labeit J, Shun J, Blelloch GE (2017) Parallel lightweight wavelet tree, suffix array and fm-index construction. J Disc Algorithms 43:2–17
Li W, Yao Y (2016) Accelerate data compression in file system. In: Data Compression Conference (DCC), 2016, IEEE, pp 615–615
Lin KC, Liao IE, Chang TP, Lin SF (2014) A frequent itemset mining algorithm based on the principle of inclusion–exclusion and transaction mapping. Inf Sci 276:278–289
Liu J, Pan Y, Wang K, Han J (2002) Mining frequent item sets by opportunistic projection. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 229–238
Miller VS, Wegman MN (1985) Variations on a theme by ziv and lempel. In: Combinatorial algorithms on words, Springer, pp 131–140
Moffat A (1990) Implementing the ppm data compression scheme. IEEE Trans Commun 38(11):1917–1921
Nelson MR (1989) Lzw data compression. Dr Dobb’s J 14(10):29–36
Oswald C, Ghosh AI, Sivaselvan B (2015) An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration, Springer, pp 563–575
Oswald C, Srinidhi S, Vishnu KS, Vishal T, Sivaselvan B (2017) Hash based frequent pattern mining approach to text compression. EAI. doi:10.4108/eai.27-2-2017.152268
Park JS, Chen Ms, Yu PS (1995) An effective hash-based algorithm for mining association rules, vol 24. ACM
Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine: Hyper-structure mining of frequent patterns in large databases. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, IEEE, pp 441–448
Policriti A, Prezza N (2016) Computing lz77 in run-compressed space. In: Data Compression Conference (DCC), 2016, IEEE, pp 23–32
Pountain D (1987) Run-length encoding. Byte 12(6):317–319
Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Data Compression Conference (DCC), 2016, IEEE, pp 231–240
Ramabadran TV, Gaitonde SS (1988) A tutorial on crc computations. IEEE Micro 4:62–75
Ramakrishnan N, Grama A (1999) Data mining: From serendipity to science—guest editors’ introduction. IEEE Comput 32(8):34–37
Rodeh M, Pratt VR, Even S (1981) Linear algorithm for data compression via string matching. J ACM (JACM) 28(1):16–24
Savasere A, Omicinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases
Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun Rev 5(1):3–55
Storer JA (1988) Data compression: methods and theory. Computer Science Press, Inc
Storer JA, Szymanski TG (1982) Data compression via textual substitution. J ACM (JACM) 29(4):928–951
Toivonen H (1996) Sampling large databases for association rules. VLDB 96:134–145
Uno T, Kiyomi M, Arimura H (2005) Lcm ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 77–86
Vitter JS (1987) Design and analysis of dynamic huffman codes. J ACM (JACM) 34(4):825–845
Williams RN (1991) An extremely fast ziv-lempel data compression algorithm. In: Data Compression Conference, 1991. DCC’91., IEEE, pp 362–371
Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–540
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390. http://dblp.uni-trier.de/db/journals/tkde/tkde12.htmlZaki00
Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 326–335
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oswald, C., Sivaselvan, B. An optimal text compression algorithm based on frequent pattern mining. J Ambient Intell Human Comput 9, 803–822 (2018). https://doi.org/10.1007/s12652-017-0540-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-017-0540-2