An optimal text compression algorithm based on frequent pattern mining

Oswald, C.; Sivaselvan, B.

doi:10.1007/s12652-017-0540-2

An optimal text compression algorithm based on frequent pattern mining

Original Research
Published: 15 July 2017

Volume 9, pages 803–822, (2018)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

772 Accesses
17 Citations
Explore all metrics

Abstract

Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

(1987) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Census+Income. Accessed 14 Jan 2017
(1989) Calgary and canterbury compression corpus datasets. http://corpus.canterbury.ac.nz/descriptions/. Accessed 12 Jan 2017
(2003) Deflate compression algorithm. http://www.pkware.com. Accessed 12 Jan 2017
(2005) TREC genomics track data. http://skynet.ohsu.edu/trec-gen/data/2004/. Accessed 12 Jan 2017
(2006) Rar compression algorithm. http://www.rarlab.com/. Accessed 12 Jan 2017
(2008) PAQ. http://www.cs.fit.edu/~mmahoney/compression. Accessed 02 June 2017
(2013) Debruijn sequence data. http://bioinf.spbau.ru/en/spadesmanual. Accessed 03 June 2017
(2013) Silesia dataset. http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia. Accessed 12 Jan 2017
(2014) Amazon reviews dataset. http://jmcauley.ucsd.edu/data/amazon/. Accessed 03 June 2017
Agarwal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12–15, 1994. Morgan Kaufmann, Santiago de Chile, Chile, pp 487–499
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000a) Mining frequent patterns with counting inference. SIGKDD Explor Newsl 2(2):66–75. doi:10.1145/380995.381017
Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000b) Mining frequent patterns with counting inference. ACM SIGKDD Explor Newslett 2(2):66–75
Article MATH Google Scholar
Basu S, Chaturvedi S, Hegde RM (2016) Text compression using lexicographic permutation of binary strings. In: 2016 International Conference on Signal Processing and Communications (SPCOM), IEEE, pp 1–5
Bentley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. Commun ACM 29(4):320–330
Article MathSciNet MATH Google Scholar
Bledsoe RE (1987) Data communication with modified Huffman coding
Borgelt C (2005) Keeping things simple: finding frequent item sets by recursive elimination. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 66–70
Borgelt C (2012) Frequent item set mining. Wiley Interdiscip Rev Data Min Knowl Discov 2(6):437–456
Article Google Scholar
Brent RP (1987) A linear algorithm for data compression. Aust Comput J 19(2):64–68
MATH Google Scholar
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Record ACM 26:255–264
Article Google Scholar
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm
David S (2004) Data compression: the complete reference, 2nd edn. http://www.ecs.csun.edu/~dxs/DC3advertis/Dcomp3Ad.html
DavidA H (1952) A method for the construction of minimum redundancy codes. Proc IRE 40(9):1098–1101
Decaroli G, Gagie T, Manzini G (2017) A compact index for order-preserving pattern matching. In: Data Compression Conference (DCC), 2017, IEEE, pp 72–81
Deng ZH, Lv SL (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appld 41(10):4505–4512
Article Google Scholar
Deng ZH, Lv SL (2015) Prepost+: an efficient n-lists-based algorithm for mining frequent itemsets via children–parent equivalence pruning. Expert Syst Appl 42(13):5424–5432
Article Google Scholar
Deorowicz S (2003) Universal lossless data compression algorithms. Philosophy Dissertation Thesis, Gliwice
Feigenblat G, Porat E, Shiftan A (2016) Linear time succinct indexable dictionary construction with applications. In: Data Compression Conference (DCC), 2016, IEEE, pp 13–22
Goethals B (2003) Survey on frequent pattern mining, manuscript
Golomb S (2006) Run-length encodings (corresp.). IEEE Trans Inf Theor 12(3):399–401. doi:10.1109/TIT.1966.1053907
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87
Article MathSciNet Google Scholar
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Article MathSciNet Google Scholar
Jakobsson M (1985) Compression of character strings by an adaptive dictionary. BIT Num Math 25(4):593–603
Article Google Scholar
Kempa D, Kosolobov D (2017) Lz-end parsing in compressed space. In: Data Compression Conference (DCC), 2017, IEEE, pp 350–359
Köppl D, Sadakane K (2016) Lempel-ziv computation in compressed space (lz-cics). In: Data Compression Conference (DCC), 2016, IEEE, pp 3–12
Labeit J, Shun J, Blelloch GE (2017) Parallel lightweight wavelet tree, suffix array and fm-index construction. J Disc Algorithms 43:2–17
Article MathSciNet MATH Google Scholar
Li W, Yao Y (2016) Accelerate data compression in file system. In: Data Compression Conference (DCC), 2016, IEEE, pp 615–615
Lin KC, Liao IE, Chang TP, Lin SF (2014) A frequent itemset mining algorithm based on the principle of inclusion–exclusion and transaction mapping. Inf Sci 276:278–289
Article Google Scholar
Liu J, Pan Y, Wang K, Han J (2002) Mining frequent item sets by opportunistic projection. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 229–238
Miller VS, Wegman MN (1985) Variations on a theme by ziv and lempel. In: Combinatorial algorithms on words, Springer, pp 131–140
Moffat A (1990) Implementing the ppm data compression scheme. IEEE Trans Commun 38(11):1917–1921
Article Google Scholar
Nelson MR (1989) Lzw data compression. Dr Dobb’s J 14(10):29–36
Google Scholar
Oswald C, Ghosh AI, Sivaselvan B (2015) An efficient text compression algorithm-data mining perspective. In: Mining Intelligence and Knowledge Exploration, Springer, pp 563–575
Oswald C, Srinidhi S, Vishnu KS, Vishal T, Sivaselvan B (2017) Hash based frequent pattern mining approach to text compression. EAI. doi:10.4108/eai.27-2-2017.152268
Google Scholar
Park JS, Chen Ms, Yu PS (1995) An effective hash-based algorithm for mining association rules, vol 24. ACM
Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine: Hyper-structure mining of frequent patterns in large databases. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, IEEE, pp 441–448
Policriti A, Prezza N (2016) Computing lz77 in run-compressed space. In: Data Compression Conference (DCC), 2016, IEEE, pp 23–32
Pountain D (1987) Run-length encoding. Byte 12(6):317–319
Google Scholar
Pratas D, Pinho AJ, Ferreira PJ (2016) Efficient compression of genomic sequences. In: Data Compression Conference (DCC), 2016, IEEE, pp 231–240
Ramabadran TV, Gaitonde SS (1988) A tutorial on crc computations. IEEE Micro 4:62–75
Article Google Scholar
Ramakrishnan N, Grama A (1999) Data mining: From serendipity to science—guest editors’ introduction. IEEE Comput 32(8):34–37
Article Google Scholar
Rodeh M, Pratt VR, Even S (1981) Linear algorithm for data compression via string matching. J ACM (JACM) 28(1):16–24
Article MathSciNet MATH Google Scholar
Savasere A, Omicinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases
Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun Rev 5(1):3–55
Article MathSciNet Google Scholar
Storer JA (1988) Data compression: methods and theory. Computer Science Press, Inc
Storer JA, Szymanski TG (1982) Data compression via textual substitution. J ACM (JACM) 29(4):928–951
Article MathSciNet MATH Google Scholar
Toivonen H (1996) Sampling large databases for association rules. VLDB 96:134–145
Google Scholar
Uno T, Kiyomi M, Arimura H (2005) Lcm ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 77–86
Vitter JS (1987) Design and analysis of dynamic huffman codes. J ACM (JACM) 34(4):825–845
Article MathSciNet MATH Google Scholar
Williams RN (1991) An extremely fast ziv-lempel data compression algorithm. In: Data Compression Conference, 1991. DCC’91., IEEE, pp 362–371
Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–540
Article Google Scholar
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390. http://dblp.uni-trier.de/db/journals/tkde/tkde12.htmlZaki00
Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 326–335
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Article MathSciNet MATH Google Scholar
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Indian Institute of Information Technology, Design and Manufacturing Kancheepuram, Chennai, India
C. Oswald & B. Sivaselvan

Authors

C. Oswald
View author publications
You can also search for this author in PubMed Google Scholar
B. Sivaselvan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. Oswald.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oswald, C., Sivaselvan, B. An optimal text compression algorithm based on frequent pattern mining. J Ambient Intell Human Comput 9, 803–822 (2018). https://doi.org/10.1007/s12652-017-0540-2

Download citation

Received: 18 January 2017
Accepted: 28 June 2017
Published: 15 July 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s12652-017-0540-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An optimal text compression algorithm based on frequent pattern mining

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An optimal text compression algorithm based on frequent pattern mining

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation