Skip to main content
Log in

Efficient Partitioning Method for Optimizing the Compression on Array Data

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that array partitioning can affect the compression performance significantly, this paper aims to design an efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Duggan J, Stonebraker M. Incremental elasticity for array databases. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp.409-420. https://doi.org/10.1145/2588555.2588569.

  2. Li J, Rotem D, Wong H K. A new compression method with fast searching on large databases. In Proc. the 13th International Conference on Very Large Data Bases, Sept. 1987, pp.311-318.

  3. Wang J, Lin C, Papakonstantinou Y, Swanson S. An experimental study of bitmap compression vs. inverted list compression. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.993-1008. https://doi.org/10.1145/3035918.3064007.

  4. Damme P, Ungethüm A, Hildebrandt J, Habich D, Lehner W. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Transactions on Database Systems, 2019, 44(3): Article No. 9. https://doi.org/10.1145/3323991.

  5. Sarawagi S, Stonebraker M. Efficient organization of large multidimensional arrays. In Proc. the 1994 International Conference on Data Engineering, Feb. 1994, pp.328-336. https://doi.org/10.1109/ICDE.1994.283048.

  6. Otoo E J, Rotem D, Seshadri S. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. the 10th International Workshop on Data Warehousing and OLAP, Nov. 2007, pp.25-32. https://doi.org/10.1145/1317331.1317337.

  7. Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In Proc. the 23rd International Conference on Scientific and Statistical Database Management, Jul. 2011, pp.1-16. https://doi.org/10.1007/978-3-642-22351-8_1.

  8. Soroush E, Balazinska M, Wang D. ArrayStore: A storage manager for complex parallel array processing. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2011, pp.253-264. https://doi.org/10.1145/1989323.1989351.

  9. Rusu F, Cheng Y. A survey on array storage, query languages, and systems. arXiv:1302.0103, 2013. https://arxiv.org/abs/1302.0103, Feb. 2022.

  10. Chang C, Moon B, Acharya A, Shock C, Sussman A, Saltz J. Titan: A high-performance remote-sensing database. In Proc. the 13th International Conference on Data Engineering, April 1997, pp.375-384. https://doi.org/10.1109/ICDE.1997.581883.

  11. Marathe A P, Salem K. Query processing techniques for arrays. The VLDB Journal, 2002, 11(1): 68-91. https://doi.org/10.1007/s007780200062.

    Article  Google Scholar 

  12. Brown P G. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2010, pp.963-968. https://doi.org/10.1145/1807167.1807271.

  13. Papadopoulos S, Datta K, Madden S, Mattson T. The TileDB array data storage manager. Proceedings of the VLDB Endowment, 2016, 10(4): 349-360. https://doi.org/10.14778/3025111.3025117.

    Article  Google Scholar 

  14. Kaddoura M, Ranka S, Wang A. Array decompositions for nonuniform computational environments. Journal of Parallel and Distributed Computing, 1996, 36(2): 91-105. https://doi.org/10.1006/jpdc.1996.0092.

    Article  Google Scholar 

  15. Muthukrishnan S, Suel T. Approximation algorithms for array partitioning problems. Journal of Algorithms, 2005, 54(1): 85-104. https://doi.org/10.1016/j.jalgor.2003.11.006.

    Article  MathSciNet  MATH  Google Scholar 

  16. Moudani W, Hussein M, Moukhtar M, Mora-Camino F. Intelligent data compression approach in multidimensional data warehouse. International Journal on Computer Science and Engineering, 2011, 3(2): 951-960.

    Google Scholar 

  17. Rahmani M M, Al-Mahmud A, Hossen M A, Rahman M, Ahmed M R, Sohan M F. A comparative analysis of traditional and modern data compression schemes for large multi-dimensional extendible array. In Proc. the 2019 International Conference on Electrical, Computer and Communication Engineering, Feb. 2019. https://doi.org/10.1109/ECACE.2019.8679182.

  18. Cormode G, Garofalakis M, Haas P J et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 2011, 4(1/2/3): 1-294. https://doi.org/10.1561/1900000004.

  19. Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37-57. https://doi.org/10.1145/3147.3165.

    Article  MathSciNet  MATH  Google Scholar 

  20. Deshpande A, Garofalakis M, Rastogi R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp.199-210. https://doi.org/10.1145/375663.375685.

  21. Reiner B, Hahn K, Höfling G, Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proc. the 13th International Conference on Database and Expert Systems Applications, Sept. 2002, pp.689-700. https://doi.org/10.1007/3-540-46146-9_68.

  22. Chai C, Li G, Li J, Deng D, Feng J. Cost-effective crowdsourced entity resolution: A partial-order approach. In Proc. the 2016 International Conference on Management of Data, Jun. 26-Jul. 1, 2016, pp.969-984. https://doi.org/10.1145/2882903.2915252.

  23. Chai C, Liu J, Tang N, Li G, Luo Y. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment, 2022, 15(7): 1466-1478. https://doi.org/10.14778/3523210.3523223.

    Article  Google Scholar 

  24. Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In Proc. the 2022 IEEE International Conference on Data Engineering, May 2022, pp.3360-3372. https://doi.org/10.1109/ICDE53745.2022.00317.

  25. Manne F, Sørevik T. Partitioning an array onto a mesh of processors. In Proc. the 3rd International Workshop on Applied Parallel Computing, Aug. 1996, pp.467-477. https://doi.org/10.1007/3-540-62095-8_50.

  26. Mingozzi A, Ricciardelli S, Spadoni M. Partitioning a matrix to minimize the maximum cost. Discrete Applied Mathematics, 1995, 62(1/2/3): 221-248. https://doi.org/10.1016/0166-218X(94)00154-6.

    Article  MathSciNet  MATH  Google Scholar 

  27. Nicol D M. Rectilinear partitioning of irregular data parallel computations. Journal of Parallel and Distributed Computing, 1994, 23(2): 119-134. https://doi.org/10.1006/jpdc.1994.1126.

    Article  Google Scholar 

  28. Saule E, Baş E Ö, Çatalyürek Ü V. Load-balancing spatially located computations using rectangular partitions. Journal of Parallel and Distributed Computing, 2012, 72(10): 1201-1214. https://doi.org/10.1016/j.jpdc.2012.05.013.

    Article  MATH  Google Scholar 

  29. Roth M A, Van Horn S J. Database compression. SIGMOD Record, 1993, 22(3): 31-39. https://doi.org/10.1145/163090.163096.

    Article  Google Scholar 

  30. Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software–Practice and Experience, 2015, 45(1): 1-29. https://doi.org/10.1002/spe.2203.

  31. Schlegel B, Gemulla R, Lehner W. Fast integer compression using SIMD instructions. In Proc. the 6th International Workshop on Data Management on New Hardware, June 2010, pp.34-40. https://doi.org/10.1145/1869389.1869394.

  32. Antoshenkov G. Byte-aligned bitmap compression. In Proc. the 1995 Data Compression Conference, Mar. 1995, pp.476. https://doi.org/10.1109/DCC.1995.515586.

  33. Wu K, Otoo E J, Shoshani A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 2006, 31(1): 1-38. https://doi.org/10.1145/1132863.1132864.

    Article  Google Scholar 

  34. Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 2010, 69(1): 3-28. https://doi.org/10.1016/j.datak.2009.08.006.

    Article  Google Scholar 

  35. Colantonio A, Di Pietro R. CONCISE: Compressed ‘n’ composable integer set. Information Processing Letters, 2010, 110(16): 644-650. https://doi.org/10.1016/j.ipl.2010.05.018.

    Article  MathSciNet  MATH  Google Scholar 

  36. Chambi S, Lemire D, Kaser O, Godin R. Better bitmap performance with roaring bitmaps. Software–Practice and Experience, 2016, 46(5): 709-719. https://doi.org/10.1002/spe.2325.

  37. Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1463-1478. https://doi.org/10.1145/3035918.3064036.

  38. Chai C, Cao L, Li G, Li J, Luo Y, Madden S. Human-in-the-loop outlier detection. In Proc. the 2020 International Conference on Management of Data, Jun. 2020, pp.19-33. https://doi.org/10.1145/3318464.3389772.

  39. Huffman D A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952, 40(9): 1098-1101. https://doi.org/10.1109/JRPROC.1952.273898.

    Article  MATH  Google Scholar 

  40. Witten I H, Neal R M, Cleary J G. Arithmetic coding for data compression. Communications of the ACM, 1987, 30(6): 520-540. https://doi.org/10.1145/214762.214771.

    Article  Google Scholar 

  41. Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. https://doi.org/10.1109/TIT.1977.1055714.

    Article  MathSciNet  MATH  Google Scholar 

  42. Agarwal R, Khandelwal A, Stoica I. Succinct: Enabling queries on compressed data. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.337-350.

  43. Mutlu O, Shen X, Zhai J, Zhang F. Potential of a method for text analytics directly on compressed data. Technical Report, North Carolina State University, 2017. https://repository.lib.ncsu.edu/bitstream/handle/1840.20/35582/TR-2017-4.pdf?sequence=1&isAllowed=y, Feb. 2022.

  44. Zhang F, Zhai J, Shen X, Mutlu O, Chen W. Zwift: A programming framework for high performance text analytics on compressed data. In Proc. the 32nd International Conference on Supercomputing, Jun. 2018, pp.195-206. https://doi.org/10.1145/3205289.3205325.

  45. Nevill-Manning C G, Witten I H. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 1997, 7: 67-82. https://doi.org/10.1613/jair.374.

    Article  MATH  Google Scholar 

  46. Zhang F, Pan Z, Zhou Y, Zhai J, Shen X, Mutlu O, Du X. GTADOC: Enabling efficient GPU-based text analytics without decompression. In Proc. the 2021 IEEE International Conference on Data Engineering, Apr. 2021, pp.1679-1690. https://doi.org/10.1109/ICDE51399.2021.00148.

  47. Zhang F, Zhai J, Shen X, Mutlu O, Du X. POCLib: A high-performance framework for enabling near orthogonal processing on compression. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(2): 459-475. https://doi.org/10.1109/TPDS.2021.3093234.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuai Han.

Supplementary Information

ESM 1

(PDF 131 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, S., Liu, XM. & Li, JZ. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 1049–1067 (2022). https://doi.org/10.1007/s11390-022-2371-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-022-2371-7

Keywords

Navigation