skip to main content
research-article

Homomorphic Compression: Making Text Processing on Compression Unlimited

Published: 12 December 2023 Publication History

Abstract

Lossless data compression is an effective way to handle the huge transmission and storage overhead of massive text data. Its utility is even more significant today when data volumes are skyrocketing. The concept of operating on compressed data infuses new blood into efficient text management by enabling mainly access-oriented text processing tasks to be done directly on compressed data without decompression. Facing limitations of the existing compressed text processing schemes such as limited types of operations supported, low efficiency, and high space occupation, we address these problems by proposing a homomorphic compression theory. It enables the generalization and characterization of algorithms with compression processing capabilities. On this basis, we develop HOCO, an efficient text data management engine that supports a variety of processing tasks on compressed text. We select three representative compression schemes and implement them combined with homomorphism in HOCO. HOCO supports the extension of homomorphic compression schemes through a modular and object-oriented design and has convenient interfaces for text processing tasks. We evaluate HOCO on six real-world datasets. The three schemes implemented in HOCO show trade-offs in terms of compression ratio, supported operation types, and efficiency. Experiments also show that HOCO can achieve higher throughput in random access and modification operations (averagely 9.18× than the state-of-the-art) and lower latency in text analytic tasks (averagely 7.16× than processing on uncompressed text) without compromising compression efficacy.

References

[1]
2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.
[2]
2017. Wikipedia HTML data dumps. https://dumps.wikimedia.org/enwiki/.
[3]
2019. COVID-19 Data from Yelp Opem Dataset. https://www.yelp.com/dataset.
[4]
2020. DBLP. https://dblp.uni-trier.de/xml/.
[5]
Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (Csur) 51, 4 (2018), 1--35.
[6]
Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. 2015. Succinct: Enabling queries on compressed data. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). 337--350.
[7]
Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, and Inge Li Gørtz. 2015. Finger search in grammar-compressed strings. arXiv preprint arXiv:1507.02853 (2015).
[8]
Philip Bille, Gad M Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. 2015. Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513--539.
[9]
Mireille Bousquet-Mélou, Markus Lohrey, Sebastian Maneth, and Eric Noeth. 2015. XML compression via directed acyclic graphs. Theory of Computing Systems 57, 4 (2015), 1322--1371.
[10]
Nieves R Brisaboa, Adrián Gómez-Brandón, Gonzalo Navarro, and José R Paramá. 2019. Gract: a grammar-based compressed index for trajectory data. Information Sciences 483 (2019), 106--135.
[11]
Michael Burrows and David Wheeler. 1994. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer.
[12]
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. 2005. The smallest grammar problem. IEEE Transactions on Information Theory 51, 7 (2005), 2554--2576.
[13]
Yixin Chen, Guozhu Dong, Jiawei Han, Jian Pei, Benjamin W Wah, and Jianyong Wang. 2006. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering 18, 12 (2006), 1585--1599.
[14]
Zheng Chen, Feng Zhang, JiaWei Guan, Jidong Zhai, Xipeng Shen, Huanchen Zhang, Wentong Shu, and Xiaoyong Du. 2023. CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--31.
[15]
Wenfei Fan. 2012. Graph pattern matching revised for social network analysis. In Proceedings of the 15th International Conference on Database Theory. 8--21.
[16]
Wenfei Fan, Jianzhong Li, Xin Wang, and Yinghui Wu. 2012. Query preserving graph compression. In Proceedings of the 2012 ACM SIGMOD international conference on management of data. 157--168.
[17]
Wenfei Fan, Yuanhao Li, Muyang Liu, and Can Lu. 2022. A Hierarchical Contraction Scheme for Querying Big Graphs. In Proceedings of the 2022 International Conference on Management of Data. 1726--1740.
[18]
Andrea Farruggia, Paolo Ferragina, and Rossano Venturini. 2014. Bicriteria data compression: Efficient and usable. In European Symposium on Algorithms. Springer, 406--417.
[19]
Paolo Ferragina, Rodrigo González, Gonzalo Navarro, and Rossano Venturini. 2009. Compressed text indexes: From theory to practice. Journal of Experimental Algorithmics (JEA) 13 (2009), 1--12.
[20]
Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science. IEEE, 390--398.
[21]
Paolo Ferragina and Giovanni Manzini. 2001. An experimental study of an opportunistic index. In SODA. 269--278.
[22]
Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed text. Journal of the ACM (JACM) 52, 4 (2005), 552--581.
[23]
Paolo Ferragina, Igor Nitto, and Rossano Venturini. 2009. On the bit-complexity of Lempel-Ziv compression. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 768--777.
[24]
Yannis Foufoulas, Lefteris Sidirourgos, Eleftherios Stamatogiannakis, and Yannis Ioannidis. 2021. Adaptive Compression for Fast Scans on String Columns. In Proceedings of the 2021 International Conference on Management of Data. 554--562.
[25]
Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. 2012. A faster grammar-based self-index. In International Conference on Language and Automata Theory and Applications. Springer, 240--251.
[26]
Moses Ganardi, Artur Jez, and Markus Lohrey. 2021. Balancing straight-line programs. Journal of the ACM (JACM) 68, 4 (2021), 1--40.
[27]
Michal Ganczorz and Artur Jez. 2017. Improvements on Re-Pair grammar compressor. In 2017 Data Compression Conference (DCC). IEEE, 181--190.
[28]
Shangqian Gao, Feihu Huang, Jian Pei, and Heng Huang. 2020. Discrete model compression with resource constraint for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1899--1908.
[29]
Adrià Gascón, Markus Lohrey, Sebastian Maneth, Carl Philipp Reh, and Kurt Sieber. 2020. Grammar-based compression of unranked trees. Theory of Computing Systems 64, 1 (2020), 141--176.
[30]
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. 2014. From theory to practice: Plug and play with succinct data structures. In International Symposium on Experimental Algorithms. Springer, 326--337.
[31]
Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399--401.
[32]
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. (2003).
[33]
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2004. When indexing equals compression: experiments with compressing suffix arrays and applications. In SODA, Vol. 4. 636--645.
[34]
Roberto Grossi and Jeffrey Scott Vitter. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the thirty-second annual ACM symposium on Theory of computing. 397--406.
[35]
Shai Halevi. 2017. Homomorphic Encryption. Springer International Publishing, Cham, 219--276. https://doi.org/10.1007/978--3--319--57048--8_5
[36]
Wing-Kai Hon, Tak Wah Lam, Wing-Kin Sung, Wai-Leuk Tse, Chi-Kwong Wong, and Siu-Ming Yiu. 2004. Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences. In ALENEX/ANALC. Citeseer, 31--38.
[37]
David A Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9(1952), 1098--1101.
[38]
Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A Chien, Jihong Ma, and Aaron J Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with CodecDB. In Proceedings of the 2021 International Conference on Management of Data. 843--856.
[39]
Sian Jin, Sheng Di, Jiannan Tian, Suren Byna, Dingwen Tao, and Franck Cappello. 2022. Improving prediction-based lossy compression dramatically via ratio-quality modeling. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2494--2507.
[40]
P Kavitha. 2016. A survey on lossless and lossy data compression methods. International Journal of Computer Science & Engineering Technology 7, 03 (2016), 110--114.
[41]
Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. 2016. BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 485--500.
[42]
John C Kieffer and En-Hui Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory 46, 3 (2000), 737--754.
[43]
SR Kodituwakku and US Amarasinghe. 2010. Comparison of lossless data compression algorithms for text data. Indian journal of computer science and engineering 1, 4 (2010), 416--425.
[44]
Michael Kuchnik, George Amvrosiadis, and Virginia Smith. 2021. Progressive Compressed Records: Taking a Byte out of Deep Learning Data. Proceedings of the VLDB Endowment 14, 11 (2021), 2627--2641.
[45]
Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Software: Practice and Experience 29, 13 (1999), 1149--1171.
[46]
Laks VS Lakshmanan, Jian Pei, and Yan Zhao. 2003. Efficacious data cube exploration by semantic summarization and compression. In Proceedings 2003 VLDB Conference. Elsevier, 1125--1128.
[47]
Laks VS Lakshmanan, Jian Pei, and Yan Zhao. 2003. Socqet: semantic olap with compressed cube and summarization. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 658--658.
[48]
N Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.
[49]
Jinbao Li and Jianzhong Li. 2005. Data sampling control and compression in sensor networks. In International Conference on Mobile Ad-Hoc and Sensor Networks. Springer, 42--51.
[50]
Jinbao Li and Jianzhong Li. 2007. Data sampling control, compression and query in sensor networks. International Journal of Sensor Networks 2, 1--2 (2007), 53--61.
[51]
Jianzhong Li, Qianqian Ren, et al . 2011. Compressing information of target tracking in wireless sensor networks. Wireless Sensor Network 3, 02 (2011), 73.
[52]
Jianzhong Li, Doron Rotem, and Jaideep Srivastava. 1999. Aggregation algorithms for very large compressed data warehouses. In VLDB, Vol. 99. 651--662.
[53]
JZ Li, Doron Rotem, and Harry KT Wong. 1987. A new compression method with fast searching on large databases. (1987).
[54]
Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.
[55]
Panagiotis Liakos, Katia Papakonstantinopoulou, Theodore Stefou, and Alex Delis. 2022. On Compressing Temporal Graphs. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1301--1313.
[56]
Markus Lohrey, Sebastian Maneth, and Roy Mennicke. 2013. XML tree structure compression using RePair. Information Systems 38, 8 (2013), 1150--1167.
[57]
Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh. 2017. Compression of unordered XML trees. In 20th International Conference on Database Theory (ICDT 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[58]
Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Efficient document re-ranking for transformers by precomputing term representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 49--58.
[59]
Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935--948.
[60]
Sebastian Maneth and Fabian Peternek. 2015. A survey on methods and systems for graph compression. arXiv preprint arXiv:1504.00616 (2015).
[61]
Sebastian Maneth and Fabian Peternek. 2018. Grammar-based graph compression. Information Systems 76 (2018), 19--45.
[62]
Alvaro E Monge, Charles Elkan, et al . 1996. The field matching problem: algorithms and applications. In Kdd, Vol. 2. 267--270.
[63]
DS Malik John N Mordeson, MK Sen, and DS Malik. 1997. Fundamentals Of Abstract Algebra. The McCGraw-HILL Companies, Inc. New York st. Louis, san Francisco, printed in Singapore (1997).
[64]
Gonzalo Navarro. 2016. Compact data structures: A practical approach. Cambridge University Press.
[65]
Craig G Nevill-Manning and Ian H Witten. 1997. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (1997), 67--82.
[66]
Matthaios Olma, Manos Karpathiotakis, Ioannis Alagiannis, Manos Athanassoulis, and Anastasia Ailamaki. 2020. Adaptive partitioning and indexing for in situ query processing. The VLDB Journal 29 (2020), 569--591.
[67]
Zaifeng Pan, Feng Zhang, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. Exploring data analytics without decompression on embedded GPU systems. IEEE Transactions on Parallel and Distributed Systems 33, 7 (2021), 1553--1568.
[68]
Qianqian Ren, Jianzhong Li, and Jinbao Li. 2007. An efficient clustering-based method for data gathering and compressing in sensor networks. In Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Vol. 1. IEEE, 823--828.
[69]
Jorma Rissanen and Glen G Langdon. 1979. Arithmetic coding. IBM Journal of research and development 23, 2 (1979), 149--162.
[70]
Wojciech Rytter. 2004. Grammar compression, LZ-encodings, and string algorithms with implicit input. In International Colloquium on Automata, Languages, and Programming. Springer, 15--27.
[71]
Kunihiko Sadakane. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In International symposium on algorithms and computation. Springer, 410--421.
[72]
Kunihiko Sadakane. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In SODA, Vol. 2. Citeseer, 225--232.
[73]
Kunihiko Sadakane. 2003. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48, 2 (2003), 294--313.
[74]
Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory of Computing Systems 41, 4 (2007), 589--607.
[75]
Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A Wood. 2015. A primer on compression in the memory hierarchy. Synthesis Lectures on Computer Architecture 10, 5 (2015), 1--86.
[76]
Khalid Sayood. 2017. Introduction to data compression. Morgan Kaufmann.
[77]
Anil Shanbhag, Bobbi W Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-based Lightweight Integer Compression in GPU. In Proceedings of the 2022 International Conference on Management of Data. 1390--1403.
[78]
Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.
[79]
Weitao Wan, Feng Zhang, Chenyang Zhang, Mingde Zhang, Jidong Zhai, Yunpeng Chai, Huanchen Zhang, Wei Lu, Yuxing Chen, Haixiang Li, et al . 2023. Compressed Data Direct Computing for Databases. IEEE Transactions on Knowledge and Data Engineering (2023).
[80]
Dawei Wang and Wanqiu Cui. 2022. An efficient graph data compression model based on the germ quotient set structure. Frontiers of Computer Science 16, 6 (2022), 166617.
[81]
Qing Wang, Hongzhi Wang, Hong Gao, and Jianzhong Li. 2010. Compression algorithms for structural query results on XML data. In Web-Age Information Management: WAIM 2010 International Workshops: IWGD 2010, XMLDM 2010, WCMT 2010, Jiuzhaigou Valley, China, July 15--17, 2010 Revised Selected Papers 11. Springer, 141--145.
[82]
Terry A. Welch. 1984. A technique for high-performance data compression. Computer 17, 06 (1984), 8--19.
[83]
Weili Wu, Hong Gao, and Jianzhong Li. 2006. New algorithm for computing cube on very large compressed data sets. IEEE transactions on knowledge and data engineering 18, 12 (2006), 1667--1680.
[84]
Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013. TripleBit: a fast and compact system for large scale RDF data. Proceedings of the VLDB Endowment 6, 7 (2013), 517--528.
[85]
Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. G-TADOC: Enabling efficient GPU-based text analytics without decompression. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1679--1690.
[86]
Feng Zhang, Weitao Wan, Chenyang Zhang, Jidong Zhai, Yunpeng Chai, Haixiang Li, and Xiaoyong Du. 2022. CompressDB: Enabling efficient compressed data direct processing for various databases. In Proceedings of the 2022 International Conference on Management of Data. 1655--1669.
[87]
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Efficient document analytics on compressed data: Method, challenges, algorithms, insights. Proceedings of the VLDB Endowment 11, 11 (2018), 1522--1535.
[88]
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Zwift: A programming framework for high performance text analytics on compressed data. In Proceedings of the 2018 International Conference on Supercomputing. 195--206.
[89]
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2020. Enabling efficient random access to hierarchically-compressed data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1069--1080.
[90]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory 23, 3 (1977), 337--343.
[91]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory 24, 5 (1978), 530--536.

Cited By

View all
  • (2024)TDSQL: Tencent Distributed Database SystemProceedings of the VLDB Endowment10.14778/3685800.368581217:12(3869-3882)Online publication date: 8-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Author Tags

  1. compression
  2. homomorphism
  3. operating on compressed data

Qualifiers

  • Research-article

Funding Sources

  • Beijing Nova Program
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)427
  • Downloads (Last 6 weeks)30
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)TDSQL: Tencent Distributed Database SystemProceedings of the VLDB Endowment10.14778/3685800.368581217:12(3869-3882)Online publication date: 8-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media