Abstract
For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.
Similar content being viewed by others
References
Agarwal, D., McGregor, A., Phillips, J.M., Venkatasubramanian, S., Zhu, Z.: Spatial scan statistics: approximations and performance study, KDD, pp. 24–33 (2006)
Agarwal, D., Phillips, J.M., Venkatasubramanian, S.: The Hunting of the Bump: On Maximizing Statistical Discrepancy, SODA, pp. 1137–1146 (2006)
Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join, VLDB, pp. 570–581 (1998)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule, KDD, pp. 29–38 (2003)
Bazaraa M.S., Sherali H.D., Shetty C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, New York (1993)
Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, University of Berkeley, ICSI-TR-97-021 (1997)
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees, SIGMOD, pp. 237–246 (1993)
Casella G., Berger R.L.: Statistical Inference, 2nd edn. Duxbury Press, North Scituate (2001)
Donjerkovic, D., Ramakrishnan, R.: Probabilistic Optimization of Top N Queries, VLDB, pp. 411–422 (1999)
Dudoit S., Shaffer J.P., Boldrick J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)
Haas, P.J., Hellerstein, J.M.: Ripple Joins for Online Aggregation, SIGMOD, pp. 287–298 (1999)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation, SIGMOD, pp. 171–182 (1997)
Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases, SIGMOD, pp. 237–248 (1998)
Hou W.-C., Özsoyoglu G.: Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst. 16, 600–654 (1991)
Kinnison R.R.: Applied Extreme Value Statistics. Macmillan, New York (1985)
Knorr E.M., Ng R.T., Tucakov V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000)
Kulldorff M.: A spatial scan statistic. Comm. Stat. Theory Methods 26, 1481–1496 (1997)
Kulldorff, M.: Spatial scan statistics: model, calculations, and applications, Scan Statistics and Applications, pp. 303–322 (1999)
Leadbetter M.R., Lindgren G., Rootzen H.: Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics. Springer, Berlin (1983)
Lee P.M.: Bayesian Statistics: An Introduction. Hodder Arnold, London (1997)
Lo, M.-L., Ravishankar, C.V.: Spatial Hash-Joins, SIGMOD, pp. 247–258 (1996)
Maritz J.S., Munro A.H.: On the use of the generalized extreme value distribution in estimating extreme percentiles. Biometrics 23, 79–103 (1976)
Neill, D.B., Moore, A.W.: A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters, NIPS, pp. 256–265 (2003)
Neill, D.B., Moore, A.W.: Rapid detection of significant spatial clusters, KDD, pp. 256–265 (2004)
Neill, D.B., Moore, A.W., Sabhnani, M., Daniel, K.: Detection of emerging space-time clusters, KDD, pp. 218–227 (2005)
Olken, F.: Random Sampling from Databases, LBL Technical Report, LBL-32883 (1993)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, SIGMOD, pp. 427–438 (2000)
Robert C.P., Casella G.: Monte Carlo Statistic Methods. Springer, Berlin (2004)
Sarndal C.-E., Swensson B., Wretman J.: Model Assisted Survey Sampling. Springer, Berlin (1992)
Schaefer, G., Stich, M.: UCID—An Uncompressed Colour Image Database, SPIE, Storage and Retrieval Methods and Applications for Multimedia, pp. 472–480 (2004)
Seidl, T., Kriegel, H.-P.: Efficient User-Adaptable Similarity Search in Large Multimedia Databases, VLDB, pp. 506–515 (1997)
Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing, SIGMOD, pp. 343–354 (2000)
Wilks S.S.: The large sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938)
Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial Anomlay Detection, CISE Technical Report, University of Florida (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wu, M., Jermaine, C. Guessing the extreme values in a data set: a Bayesian method and its applications. The VLDB Journal 18, 571–597 (2009). https://doi.org/10.1007/s00778-009-0133-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0133-6