ABSTRACT
Recently, massive data management plays an increasingly important role in data analytics because data access is a major bottleneck. Data skipping is a promising technique to reduce the number of data accesses. Data skipping partitions data into pages and accesses only pages that contain data to be retrieved by a query. Therefore, effective data partitioning is required to minimize the number of page accesses. However, it is an NP-hard problem to obtain optimal data partitioning given query pattern and data distribution.
We propose a framework that involves a multidimensional indexing technique based on a space-filling curve. A space-filling curve is a way to define which portion of data can be stored in the same page. Therefore, the problem can be interpreted as selecting a curve that distributes data to be accessed by a query to minimize the number of page accesses. To solve this problem, we analyzed how different space-filling curves affect the number of page accesses. We found that it is critical for a curve to fit a query pattern and be robust against any data distribution. We propose a cost model for measuring how well a space-filling curve fits a given query pattern and tolerates data skew. Also we propose a method for designing a query-aware and skew-tolerant curve for a given query pattern.
We prototyped our framework using the defined query-aware and skew-tolerant curve. We conducted experiments using a skew data set, and confirmed that our framework can reduce the number of page accesses by an order of magnitude for data warehousing (DWH) and geographic information systems (GIS) applications with real-world data.
- Decimal degree. https://en.wikipedia.org/wiki/Decimal_degrees.Google Scholar
- libspatialindex. https://github.com/libspatialindex/libspatialindex.Google Scholar
- uzaygezen. https://github.com/aioaneid/uzaygezen.Google Scholar
- M. Bader. Space-Filling Curves: An Introduction with Applications in Scientific Computing, volume 9 of Texts in Computational Science and Engineering. Springer Berlin Heidelberg, 2013. Google ScholarDigital Library
- C. Faloutsos. Gray codes for partial match and range queries. IEEE Transactions on Software Engineering, 14(10):1381--1393, Oct. 1988. Google ScholarDigital Library
- C. Faloutsos. Multiattribute hashing using gray codes. In the ACM SIGMOD Conference, pages 227--238, May 1986. Google ScholarDigital Library
- C. H. Hamilton and A. Rau-Chaplin. Compact hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105:155--163, 2008. Google ScholarDigital Library
- M. Hazewinkel, editor. Encyclopedia of Mathematics, chapter Multinomial coefficient. Springer, 2001. http://www.encyclopediaofmath.org/index.php/Multinomial_coefficient.Google Scholar
- HBase: Bigtable-like structured storage for Hadoop HDFS, 2010. http://hadoop.apache.org/hbase/.Google Scholar
- D. Hilbert. Ueber stetige abbildung einer linie auf flächenstück. Mathematische Annalen, 38:459--460, 1891.Google ScholarCross Ref
- S. Huang, B. Wang, J. Zhu, G. Wang, and G. Yu. R-hbase: A multi-dimensional indexing framework for cloud computing environment. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on, pages 569--574, Dec 2014.Google ScholarCross Ref
- R. Kimball and M. Ross. The Data Warehouse Toolkit: the complete guide to dimensional modeling. Wiley Computer Publishing, second edition, 2002. Google ScholarDigital Library
- J. K. Lawder. Querying multi-dimensional data indexed using the hilbert space-filling curve. SIGMOD Record, 30:2001, 2001. Google ScholarDigital Library
- X. Liu and G. F. Schrack. A new ordering strategy applied to spatial data processing. International Journal Geographical Information Science, 12(1):3--22, Jan. 1998.Google ScholarCross Ref
- V. Markl. MISTRAL: Processing Relational Queries using a Multidimensional Access Technique. PhD thesis, TU München, 1999.Google Scholar
- V. Markl and R. Bayer. Processing Relational OLAP Queries with UB-Trees and Multidimensional Hierarchical Clustering. Proceedings of the International Workshop on Design and Management of Data Warehouses, 2000:1--10, 2000.Google Scholar
- M. F. Mokbel and W. G. Aref. Irregularity in multidimensional space-filling curves with applications in multimedia databases. In In Proceedings of the International Conference on Information and Knowledge Managemen, CIKM, 2001. Google ScholarDigital Library
- M. F. Mokbel and W. G. Aref. On query processing and optimality using spectral locality-preserving mappings. Advances in Spatial and Temporal Databases Lecture Notes in Computer Science, 2750:102--121, 2003.Google ScholarCross Ref
- M. F. Mokbel, W. G. Aref, and I. Kamel. Analysis of multi-dimensional space-filling curves. Geoinformatica, 7(3):179--209, Sept. 2003. Google ScholarDigital Library
- B. Moon, H. V. Jagadish, C. Faloutsos, and J. Salz. Analysis of the clustering properties of hilbert space-filling curve. IEEE Trans. Knowl. Data Eng., TKDE, 13(1):124--141, 2001. Google ScholarDigital Library
- D. Moore. Fast hilbert curve generation, sorting, and range queries. http://www.tiac.net/ sw/2008/10/Hilbert/moore/index.html.Google Scholar
- G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., 1966.Google Scholar
- S. Nishimura, S. Das, D. Agrawal, and A. El Abbadi. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of 12th IEEE International Conference on Mobile Data Management, MDM, pages 7--16, 2011. Google ScholarDigital Library
- NYC Taxi & Limousine Commission. TLC Trip Record Data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.Google Scholar
- V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. Db2 with blu acceleration: So much more than just a column store. Proc. VLDB Endow., 6(11):1080--1091, Aug. 2013. Google ScholarDigital Library
- F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, and R. Bayer. Integrating the ub-tree into a database system kernel. In 26th International Conference on Very Large Data Bases, pages 263--272, Sep. 2000. Google ScholarDigital Library
- H. Sagan. Space-Filling Curves. Springer-Verlag, 1994.Google Scholar
- H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. Google ScholarDigital Library
- G. Schrack and X. Liu. The spatial u-order and some of its mathematical characteristics. In the Pacific Rim Conference on Cummunications, Computers, and Signal Processing, pages 416--419, May 1995.Google ScholarCross Ref
- T. Skopal, M. Krátký, J. s. Pokorný, and V. Snášel. A new range query algorithm for Universal B-trees. Information Systems, 31(6):489--511, Sept. 2006. Google ScholarDigital Library
- L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1115--1126, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Sybase, Inc. Performance and Tuning: Basics, Aug. 2003. Chapter 13: Indexing for Performance.Google Scholar
- H. Tropf and H. Herzong. Multidimensional range search in dynamically balanced trees. Angewandte Informatik, 23(2):71--77, Feb. 1981.Google Scholar
- M. White. N-trees: large ordered indexes for multi-dimensional space. Technical report, Statistical Research Division, US Bureau of the Census, 1982.Google Scholar
- P. Xu and S. Tirthapura. On the optimality of clustering properties of space filling curves. In Proceedings of the 31st Symposium on Principles of Database Systems, PODS '12, pages 215--224, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Y. Zou, J. Liu, S. Wang, L. Zha, and Z. Xu. Ccindex: A complemental clustering index on distributed ordered tables for multi-dimensional range queries. Network and Parallel Computing, 6289:247--261, 2010. Google ScholarDigital Library
Index Terms
- QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and Skew-Tolerant Space-Filling Curves
Recommendations
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion
Because they are continuous and self-similar, space-filling curves have been widely used in mathematics to transform multi-dimensional problems into one-dimensional forms. For scientific applications, reordering computation by certain space-filling ...
Clustering Analyses of Two-Dimensional Space-Filling Curves: Hilbert and z-Order Curves
AbstractA discrete space-filling curve provides a linear traversal or indexing of a multi-dimensional grid space. This paper presents two analytical studies on clustering analyses of the 2-dimensional Hilbert and z-order curve families. The underlying ...
Generation of Spatial Orders and Space-Filling Curves
Space-filling curves have been found useful for many applications in diverse fields. A space-filling curve is a path in a 2<sup>r</sup>×2<sup>r</sup> raster domain, which visits each location exactly once. In mathematical terms, space-filling ...
Comments