Skip to main content
Log in

STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Histograms have been widely used for estimating selectivity in query optimization. In this paper, we propose a new histogram construction method for geographic data objects that are used in many real-world applications. The proposed method is based on analyses and utilization of clusters of objects that exist in a given data set, to build histograms with significantly enhanced accuracy. Our philosophy in allocating the histogram buckets is to allocate them to the subspaces that properly capture object clusters. Therefore, we first propose a procedure to find the centers of object clusters. Then, we propose an algorithm to construct the histogram buckets from these centers. The buckets are initialized from the clusters’ centers, then expanded to cover the clusters. Best expansion plans are chosen based on a notion of skewness gain. Results from extensive experiments using real-life data sets demonstrate that the proposed method can really improve the accuracy of the histograms further, when compared with the current state of the art histogram construction method for geographic data objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Oracle database 10g sql reference. http://www.oracle.com/pls/db102 (2011)

  2. The aggdata database. http://www.aggdata.com (2011)

  3. The geonames geographical database. http://www.geonames.org (2011)

  4. R-tree portal. http://www.rtreeportal.org (2011)

  5. Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: SIGMOD conference, pp 181–192

  6. Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: SIGMOD conference, pp 13–24

  7. Arthur D, Vassilvitskii S (2007) k-means+ +: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  8. Blohsfeld B, Korus D, Seeger B (1999) A comparison of selectivity estimators for range queries on metric attributes. In: SIGMOD conference, pp 239–250

  9. Bruno N, Chaudhuri S, Gravano L (2001) Stholes: a multidimensional workload-aware histogram. In: SIGMOD conference, pp 211–222

  10. Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40

    Article  Google Scholar 

  11. Clark I, Harper WV (2000) Practical geostatistics 2000. Ecosse North America LLC

  12. Eavis T, Lopez A (2007) Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation. In: CIKM, pp 475–484

  13. Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298

    Article  Google Scholar 

  14. Guha S, Shim K, Woo J (2004) Rehist: relative error histogram construction algorithms. In: VLDB, pp 300–311

  15. Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2):137–154

    Article  Google Scholar 

  16. Haas PJ, Swami AN (1992) Sequential sampling procedures for query size estimation. In: SIGMOD conference, pp 341–350

  17. Hartigan JA (1975) Clustering algorithms. John Wiley and Sons, New York

    Google Scholar 

  18. Ioannidis YE (2003) The history of histograms (abridged). In: VLDB, pp 19–30

  19. Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB, pp 275–286

  20. Kooi RP (1980) The optimization of queries in relational databases. PhD thesis, Case Western Reserver University

  21. Lee JH, Kim DH, Chung CW (1999) Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD conference, pp 205–214

  22. Lipton RJ, Naughton JF, Schneider DA (1990) Practical selectivity estimation through adaptive sampling. In: SIGMOD conference, pp 1–11

  23. MacQueen JB (1967) Some methods for classification and anlysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

  24. Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp 448–459

  25. Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD conference, pp 28–36

  26. Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: ICDT, pp 236–256

  27. Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of Lloyd-type methods for the k-means problem. In: Proceedings of the 47th annual IEEE symposium on Foundations of Computer Science (FOCS), p 165–174

  28. Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD conference, pp 256–276

  29. Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB, pp 486–495

  30. Roh YJ, Kim JH, Chung YD, Son JH, Kim MH (2010) Hierarchically organized skew-tolerant histograms for geographic data objects. In: SIGMOD conference, pp 627–638

  31. Rokach L (2010) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pp 269–298

  32. Srivastava U, Haas PJ, Markl V, Kutsch M, Tran TM (2006) Isomer: consistent histogram construction using query feedback. In: ICDE, p 39

  33. Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98(463):750–763

    Article  Google Scholar 

  34. Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: SIGMOD Conference, pp 428–439

  35. Vitter JS, Wang M, Iyer BR (1998) Data cube approximation and histograms via wavelets. In: CIKM, pp 96–104

  36. Xu R, Wunsch D II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0020415). The authors also thank anonymous reviewers for valuable comments to improve this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Thanh Mai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mai, H.T., Kim, J., Roh, Y.J. et al. STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points. Geoinformatica 17, 325–352 (2013). https://doi.org/10.1007/s10707-012-0154-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-012-0154-y

Keywords

Navigation