Abstract
Histograms have been widely used for estimating selectivity in query optimization. In this paper, we propose a new histogram construction method for geographic data objects that are used in many real-world applications. The proposed method is based on analyses and utilization of clusters of objects that exist in a given data set, to build histograms with significantly enhanced accuracy. Our philosophy in allocating the histogram buckets is to allocate them to the subspaces that properly capture object clusters. Therefore, we first propose a procedure to find the centers of object clusters. Then, we propose an algorithm to construct the histogram buckets from these centers. The buckets are initialized from the clusters’ centers, then expanded to cover the clusters. Best expansion plans are chosen based on a notion of skewness gain. Results from extensive experiments using real-life data sets demonstrate that the proposed method can really improve the accuracy of the histograms further, when compared with the current state of the art histogram construction method for geographic data objects.
















Similar content being viewed by others
References
Oracle database 10g sql reference. http://www.oracle.com/pls/db102 (2011)
The aggdata database. http://www.aggdata.com (2011)
The geonames geographical database. http://www.geonames.org (2011)
R-tree portal. http://www.rtreeportal.org (2011)
Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: SIGMOD conference, pp 181–192
Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: SIGMOD conference, pp 13–24
Arthur D, Vassilvitskii S (2007) k-means+ +: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Blohsfeld B, Korus D, Seeger B (1999) A comparison of selectivity estimators for range queries on metric attributes. In: SIGMOD conference, pp 239–250
Bruno N, Chaudhuri S, Gravano L (2001) Stholes: a multidimensional workload-aware histogram. In: SIGMOD conference, pp 211–222
Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Clark I, Harper WV (2000) Practical geostatistics 2000. Ecosse North America LLC
Eavis T, Lopez A (2007) Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation. In: CIKM, pp 475–484
Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298
Guha S, Shim K, Woo J (2004) Rehist: relative error histogram construction algorithms. In: VLDB, pp 300–311
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2):137–154
Haas PJ, Swami AN (1992) Sequential sampling procedures for query size estimation. In: SIGMOD conference, pp 341–350
Hartigan JA (1975) Clustering algorithms. John Wiley and Sons, New York
Ioannidis YE (2003) The history of histograms (abridged). In: VLDB, pp 19–30
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB, pp 275–286
Kooi RP (1980) The optimization of queries in relational databases. PhD thesis, Case Western Reserver University
Lee JH, Kim DH, Chung CW (1999) Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD conference, pp 205–214
Lipton RJ, Naughton JF, Schneider DA (1990) Practical selectivity estimation through adaptive sampling. In: SIGMOD conference, pp 1–11
MacQueen JB (1967) Some methods for classification and anlysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp 448–459
Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD conference, pp 28–36
Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: ICDT, pp 236–256
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of Lloyd-type methods for the k-means problem. In: Proceedings of the 47th annual IEEE symposium on Foundations of Computer Science (FOCS), p 165–174
Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD conference, pp 256–276
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB, pp 486–495
Roh YJ, Kim JH, Chung YD, Son JH, Kim MH (2010) Hierarchically organized skew-tolerant histograms for geographic data objects. In: SIGMOD conference, pp 627–638
Rokach L (2010) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pp 269–298
Srivastava U, Haas PJ, Markl V, Kutsch M, Tran TM (2006) Isomer: consistent histogram construction using query feedback. In: ICDE, p 39
Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98(463):750–763
Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: SIGMOD Conference, pp 428–439
Vitter JS, Wang M, Iyer BR (1998) Data cube approximation and histograms via wavelets. In: CIKM, pp 96–104
Xu R, Wunsch D II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0020415). The authors also thank anonymous reviewers for valuable comments to improve this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mai, H.T., Kim, J., Roh, Y.J. et al. STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points. Geoinformatica 17, 325–352 (2013). https://doi.org/10.1007/s10707-012-0154-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-012-0154-y