Efficient Partitioning Method for Optimizing the Compression on Array Data

Han, Shuai; Liu, Xian-Min; Li, Jian-Zhong

doi:10.1007/s11390-022-2371-7

Efficient Partitioning Method for Optimizing the Compression on Array Data

Regular Paper
Published: 30 September 2022

Volume 37, pages 1049–1067, (2022)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Shuai Han¹,
Xian-Min Liu¹ &
Jian-Zhong Li¹

216 Accesses
2 Citations
Explore all metrics

Abstract

Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that array partitioning can affect the compression performance significantly, this paper aims to design an efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Duggan J, Stonebraker M. Incremental elasticity for array databases. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp.409-420. https://doi.org/10.1145/2588555.2588569.
Li J, Rotem D, Wong H K. A new compression method with fast searching on large databases. In Proc. the 13th International Conference on Very Large Data Bases, Sept. 1987, pp.311-318.
Wang J, Lin C, Papakonstantinou Y, Swanson S. An experimental study of bitmap compression vs. inverted list compression. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.993-1008. https://doi.org/10.1145/3035918.3064007.
Damme P, Ungethüm A, Hildebrandt J, Habich D, Lehner W. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Transactions on Database Systems, 2019, 44(3): Article No. 9. https://doi.org/10.1145/3323991.
Sarawagi S, Stonebraker M. Efficient organization of large multidimensional arrays. In Proc. the 1994 International Conference on Data Engineering, Feb. 1994, pp.328-336. https://doi.org/10.1109/ICDE.1994.283048.
Otoo E J, Rotem D, Seshadri S. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. the 10th International Workshop on Data Warehousing and OLAP, Nov. 2007, pp.25-32. https://doi.org/10.1145/1317331.1317337.
Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In Proc. the 23rd International Conference on Scientific and Statistical Database Management, Jul. 2011, pp.1-16. https://doi.org/10.1007/978-3-642-22351-8_1.
Soroush E, Balazinska M, Wang D. ArrayStore: A storage manager for complex parallel array processing. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2011, pp.253-264. https://doi.org/10.1145/1989323.1989351.
Rusu F, Cheng Y. A survey on array storage, query languages, and systems. arXiv:1302.0103, 2013. https://arxiv.org/abs/1302.0103, Feb. 2022.
Chang C, Moon B, Acharya A, Shock C, Sussman A, Saltz J. Titan: A high-performance remote-sensing database. In Proc. the 13th International Conference on Data Engineering, April 1997, pp.375-384. https://doi.org/10.1109/ICDE.1997.581883.
Marathe A P, Salem K. Query processing techniques for arrays. The VLDB Journal, 2002, 11(1): 68-91. https://doi.org/10.1007/s007780200062.
Article Google Scholar
Brown P G. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2010, pp.963-968. https://doi.org/10.1145/1807167.1807271.
Papadopoulos S, Datta K, Madden S, Mattson T. The TileDB array data storage manager. Proceedings of the VLDB Endowment, 2016, 10(4): 349-360. https://doi.org/10.14778/3025111.3025117.
Article Google Scholar
Kaddoura M, Ranka S, Wang A. Array decompositions for nonuniform computational environments. Journal of Parallel and Distributed Computing, 1996, 36(2): 91-105. https://doi.org/10.1006/jpdc.1996.0092.
Article Google Scholar
Muthukrishnan S, Suel T. Approximation algorithms for array partitioning problems. Journal of Algorithms, 2005, 54(1): 85-104. https://doi.org/10.1016/j.jalgor.2003.11.006.
Article MathSciNet MATH Google Scholar
Moudani W, Hussein M, Moukhtar M, Mora-Camino F. Intelligent data compression approach in multidimensional data warehouse. International Journal on Computer Science and Engineering, 2011, 3(2): 951-960.
Google Scholar
Rahmani M M, Al-Mahmud A, Hossen M A, Rahman M, Ahmed M R, Sohan M F. A comparative analysis of traditional and modern data compression schemes for large multi-dimensional extendible array. In Proc. the 2019 International Conference on Electrical, Computer and Communication Engineering, Feb. 2019. https://doi.org/10.1109/ECACE.2019.8679182.
Cormode G, Garofalakis M, Haas P J et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 2011, 4(1/2/3): 1-294. https://doi.org/10.1561/1900000004.
Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37-57. https://doi.org/10.1145/3147.3165.
Article MathSciNet MATH Google Scholar
Deshpande A, Garofalakis M, Rastogi R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp.199-210. https://doi.org/10.1145/375663.375685.
Reiner B, Hahn K, Höfling G, Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proc. the 13th International Conference on Database and Expert Systems Applications, Sept. 2002, pp.689-700. https://doi.org/10.1007/3-540-46146-9_68.
Chai C, Li G, Li J, Deng D, Feng J. Cost-effective crowdsourced entity resolution: A partial-order approach. In Proc. the 2016 International Conference on Management of Data, Jun. 26-Jul. 1, 2016, pp.969-984. https://doi.org/10.1145/2882903.2915252.
Chai C, Liu J, Tang N, Li G, Luo Y. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment, 2022, 15(7): 1466-1478. https://doi.org/10.14778/3523210.3523223.
Article Google Scholar
Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In Proc. the 2022 IEEE International Conference on Data Engineering, May 2022, pp.3360-3372. https://doi.org/10.1109/ICDE53745.2022.00317.
Manne F, Sørevik T. Partitioning an array onto a mesh of processors. In Proc. the 3rd International Workshop on Applied Parallel Computing, Aug. 1996, pp.467-477. https://doi.org/10.1007/3-540-62095-8_50.
Mingozzi A, Ricciardelli S, Spadoni M. Partitioning a matrix to minimize the maximum cost. Discrete Applied Mathematics, 1995, 62(1/2/3): 221-248. https://doi.org/10.1016/0166-218X(94)00154-6.
Article MathSciNet MATH Google Scholar
Nicol D M. Rectilinear partitioning of irregular data parallel computations. Journal of Parallel and Distributed Computing, 1994, 23(2): 119-134. https://doi.org/10.1006/jpdc.1994.1126.
Article Google Scholar
Saule E, Baş E Ö, Çatalyürek Ü V. Load-balancing spatially located computations using rectangular partitions. Journal of Parallel and Distributed Computing, 2012, 72(10): 1201-1214. https://doi.org/10.1016/j.jpdc.2012.05.013.
Article MATH Google Scholar
Roth M A, Van Horn S J. Database compression. SIGMOD Record, 1993, 22(3): 31-39. https://doi.org/10.1145/163090.163096.
Article Google Scholar
Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software–Practice and Experience, 2015, 45(1): 1-29. https://doi.org/10.1002/spe.2203.
Schlegel B, Gemulla R, Lehner W. Fast integer compression using SIMD instructions. In Proc. the 6th International Workshop on Data Management on New Hardware, June 2010, pp.34-40. https://doi.org/10.1145/1869389.1869394.
Antoshenkov G. Byte-aligned bitmap compression. In Proc. the 1995 Data Compression Conference, Mar. 1995, pp.476. https://doi.org/10.1109/DCC.1995.515586.
Wu K, Otoo E J, Shoshani A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 2006, 31(1): 1-38. https://doi.org/10.1145/1132863.1132864.
Article Google Scholar
Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 2010, 69(1): 3-28. https://doi.org/10.1016/j.datak.2009.08.006.
Article Google Scholar
Colantonio A, Di Pietro R. CONCISE: Compressed ‘n’ composable integer set. Information Processing Letters, 2010, 110(16): 644-650. https://doi.org/10.1016/j.ipl.2010.05.018.
Article MathSciNet MATH Google Scholar
Chambi S, Lemire D, Kaser O, Godin R. Better bitmap performance with roaring bitmaps. Software–Practice and Experience, 2016, 46(5): 709-719. https://doi.org/10.1002/spe.2325.
Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1463-1478. https://doi.org/10.1145/3035918.3064036.
Chai C, Cao L, Li G, Li J, Luo Y, Madden S. Human-in-the-loop outlier detection. In Proc. the 2020 International Conference on Management of Data, Jun. 2020, pp.19-33. https://doi.org/10.1145/3318464.3389772.
Huffman D A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952, 40(9): 1098-1101. https://doi.org/10.1109/JRPROC.1952.273898.
Article MATH Google Scholar
Witten I H, Neal R M, Cleary J G. Arithmetic coding for data compression. Communications of the ACM, 1987, 30(6): 520-540. https://doi.org/10.1145/214762.214771.
Article Google Scholar
Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. https://doi.org/10.1109/TIT.1977.1055714.
Article MathSciNet MATH Google Scholar
Agarwal R, Khandelwal A, Stoica I. Succinct: Enabling queries on compressed data. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.337-350.
Mutlu O, Shen X, Zhai J, Zhang F. Potential of a method for text analytics directly on compressed data. Technical Report, North Carolina State University, 2017. https://repository.lib.ncsu.edu/bitstream/handle/1840.20/35582/TR-2017-4.pdf?sequence=1&isAllowed=y, Feb. 2022.
Zhang F, Zhai J, Shen X, Mutlu O, Chen W. Zwift: A programming framework for high performance text analytics on compressed data. In Proc. the 32nd International Conference on Supercomputing, Jun. 2018, pp.195-206. https://doi.org/10.1145/3205289.3205325.
Nevill-Manning C G, Witten I H. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 1997, 7: 67-82. https://doi.org/10.1613/jair.374.
Article MATH Google Scholar
Zhang F, Pan Z, Zhou Y, Zhai J, Shen X, Mutlu O, Du X. GTADOC: Enabling efficient GPU-based text analytics without decompression. In Proc. the 2021 IEEE International Conference on Data Engineering, Apr. 2021, pp.1679-1690. https://doi.org/10.1109/ICDE51399.2021.00148.
Zhang F, Zhai J, Shen X, Mutlu O, Du X. POCLib: A high-performance framework for enabling near orthogonal processing on compression. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(2): 459-475. https://doi.org/10.1109/TPDS.2021.3093234.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
Shuai Han, Xian-Min Liu & Jian-Zhong Li

Authors

Shuai Han
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Min Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Zhong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuai Han.

Supplementary Information

ESM 1

(PDF 131 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, S., Liu, XM. & Li, JZ. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 1049–1067 (2022). https://doi.org/10.1007/s11390-022-2371-7

Download citation

Received: 31 March 2022
Accepted: 21 September 2022
Published: 30 September 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11390-022-2371-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Partitioning Method for Optimizing the Compression on Array Data

Abstract

Access this article

Similar content being viewed by others

Abstraction of Arrays Based on Non Contiguous Partitions

Sliced AETG: a memory-efficient variant of the AETG covering array generation algorithm

A scalable array storage for efficient maintenance of future data

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Partitioning Method for Optimizing the Compression on Array Data

Abstract

Access this article

Similar content being viewed by others

Abstraction of Arrays Based on Non Contiguous Partitions

Sliced AETG: a memory-efficient variant of the AETG covering array generation algorithm

A scalable array storage for efficient maintenance of future data

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation