STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Mai, Hai Thanh; Kim, Jaeho; Roh, Yohan J.; Kim, Myoung Ho

doi:10.1007/s10707-012-0154-y

STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Published: 10 February 2012

Volume 17, pages 325–352, (2013)
Cite this article

GeoInformatica Aims and scope Submit manuscript

Hai Thanh Mai¹,
Jaeho Kim¹,
Yohan J. Roh² &
…
Myoung Ho Kim¹

401 Accesses
2 Citations
Explore all metrics

Abstract

Histograms have been widely used for estimating selectivity in query optimization. In this paper, we propose a new histogram construction method for geographic data objects that are used in many real-world applications. The proposed method is based on analyses and utilization of clusters of objects that exist in a given data set, to build histograms with significantly enhanced accuracy. Our philosophy in allocating the histogram buckets is to allocate them to the subspaces that properly capture object clusters. Therefore, we first propose a procedure to find the centers of object clusters. Then, we propose an algorithm to construct the histogram buckets from these centers. The buckets are initialized from the clusters’ centers, then expanded to cover the clusters. Best expansion plans are chosen based on a notion of skewness gain. Results from extensive experiments using real-life data sets demonstrate that the proposed method can really improve the accuracy of the histograms further, when compared with the current state of the art histogram construction method for geographic data objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Article 03 April 2020

Hilbert R-tree Space Indexing Based on RHCA Clustering

Unsupervised discretization by two-dimensional MDL-based histogram

Article Open access 16 February 2023

References

Oracle database 10g sql reference. http://www.oracle.com/pls/db102 (2011)
The aggdata database. http://www.aggdata.com (2011)
The geonames geographical database. http://www.geonames.org (2011)
R-tree portal. http://www.rtreeportal.org (2011)
Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: SIGMOD conference, pp 181–192
Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: SIGMOD conference, pp 13–24
Arthur D, Vassilvitskii S (2007) k-means+ +: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Blohsfeld B, Korus D, Seeger B (1999) A comparison of selectivity estimators for range queries on metric attributes. In: SIGMOD conference, pp 239–250
Bruno N, Chaudhuri S, Gravano L (2001) Stholes: a multidimensional workload-aware histogram. In: SIGMOD conference, pp 211–222
Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Article Google Scholar
Clark I, Harper WV (2000) Practical geostatistics 2000. Ecosse North America LLC
Eavis T, Lopez A (2007) Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation. In: CIKM, pp 475–484
Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298
Article Google Scholar
Guha S, Shim K, Woo J (2004) Rehist: relative error histogram construction algorithms. In: VLDB, pp 300–311
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2):137–154
Article Google Scholar
Haas PJ, Swami AN (1992) Sequential sampling procedures for query size estimation. In: SIGMOD conference, pp 341–350
Hartigan JA (1975) Clustering algorithms. John Wiley and Sons, New York
Google Scholar
Ioannidis YE (2003) The history of histograms (abridged). In: VLDB, pp 19–30
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB, pp 275–286
Kooi RP (1980) The optimization of queries in relational databases. PhD thesis, Case Western Reserver University
Lee JH, Kim DH, Chung CW (1999) Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD conference, pp 205–214
Lipton RJ, Naughton JF, Schneider DA (1990) Practical selectivity estimation through adaptive sampling. In: SIGMOD conference, pp 1–11
MacQueen JB (1967) Some methods for classification and anlysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp 448–459
Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD conference, pp 28–36
Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: ICDT, pp 236–256
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of Lloyd-type methods for the k-means problem. In: Proceedings of the 47th annual IEEE symposium on Foundations of Computer Science (FOCS), p 165–174
Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD conference, pp 256–276
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB, pp 486–495
Roh YJ, Kim JH, Chung YD, Son JH, Kim MH (2010) Hierarchically organized skew-tolerant histograms for geographic data objects. In: SIGMOD conference, pp 627–638
Rokach L (2010) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pp 269–298
Srivastava U, Haas PJ, Markl V, Kutsch M, Tran TM (2006) Isomer: consistent histogram construction using query feedback. In: ICDE, p 39
Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98(463):750–763
Article Google Scholar
Thaper N, Guha S, Indyk P, Koudas N (2002) Dynamic multidimensional histograms. In: SIGMOD Conference, pp 428–439
Vitter JS, Wang M, Iyer BR (1998) Data cube approximation and histograms via wavelets. In: CIKM, pp 96–104
Xu R, Wunsch D II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0020415). The authors also thank anonymous reviewers for valuable comments to improve this work.

Author information

Authors and Affiliations

Department of Computer Science, KAIST, 373-1 Guseong-Dong, Yuseong-Gu, Daejeon, 305-701, South Korea
Hai Thanh Mai, Jaeho Kim & Myoung Ho Kim
Samsung Advanced Institute of Technology, Samsung Electronics, Nongseo-dong, Giheung-gu, Yongin Si, Gyeonggi-Do, 446-712, South Korea
Yohan J. Roh

Authors

Hai Thanh Mai
View author publications
You can also search for this author in PubMed Google Scholar
Jaeho Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yohan J. Roh
View author publications
You can also search for this author in PubMed Google Scholar
Myoung Ho Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Thanh Mai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mai, H.T., Kim, J., Roh, Y.J. et al. STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points. Geoinformatica 17, 325–352 (2013). https://doi.org/10.1007/s10707-012-0154-y

Download citation

Received: 21 July 2011
Revised: 20 November 2011
Accepted: 23 January 2012
Published: 10 February 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10707-012-0154-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Hilbert R-tree Space Indexing Based on RHCA Clustering

Unsupervised discretization by two-dimensional MDL-based histogram

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now