Guessing the extreme values in a data set: a Bayesian method and its applications

Wu, Mingxi; Jermaine, Chris

doi:10.1007/s00778-009-0133-6

Guessing the extreme values in a data set: a Bayesian method and its applications

Special Issue Paper
Published: 12 February 2009

Volume 18, pages 571–597, (2009)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Mingxi Wu¹ &
Chris Jermaine¹

156 Accesses
Explore all metrics

Abstract

For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal, D., McGregor, A., Phillips, J.M., Venkatasubramanian, S., Zhu, Z.: Spatial scan statistics: approximations and performance study, KDD, pp. 24–33 (2006)
Agarwal, D., Phillips, J.M., Venkatasubramanian, S.: The Hunting of the Bump: On Maximizing Statistical Discrepancy, SODA, pp. 1137–1146 (2006)
Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join, VLDB, pp. 570–581 (1998)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule, KDD, pp. 29–38 (2003)
Bazaraa M.S., Sherali H.D., Shetty C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, New York (1993)
MATH Google Scholar
Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, University of Berkeley, ICSI-TR-97-021 (1997)
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees, SIGMOD, pp. 237–246 (1993)
Casella G., Berger R.L.: Statistical Inference, 2nd edn. Duxbury Press, North Scituate (2001)
Google Scholar
Donjerkovic, D., Ramakrishnan, R.: Probabilistic Optimization of Top N Queries, VLDB, pp. 411–422 (1999)
Dudoit S., Shaffer J.P., Boldrick J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)
Article MATH MathSciNet Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple Joins for Online Aggregation, SIGMOD, pp. 287–298 (1999)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation, SIGMOD, pp. 171–182 (1997)
Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases, SIGMOD, pp. 237–248 (1998)
Hou W.-C., Özsoyoglu G.: Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst. 16, 600–654 (1991)
Article Google Scholar
http://www.armprogram.com/
Kinnison R.R.: Applied Extreme Value Statistics. Macmillan, New York (1985)
Google Scholar
Knorr E.M., Ng R.T., Tucakov V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000)
Article Google Scholar
Kulldorff M.: A spatial scan statistic. Comm. Stat. Theory Methods 26, 1481–1496 (1997)
Article MATH MathSciNet Google Scholar
Kulldorff, M.: Spatial scan statistics: model, calculations, and applications, Scan Statistics and Applications, pp. 303–322 (1999)
Leadbetter M.R., Lindgren G., Rootzen H.: Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics. Springer, Berlin (1983)
MATH Google Scholar
Lee P.M.: Bayesian Statistics: An Introduction. Hodder Arnold, London (1997)
MATH Google Scholar
Lo, M.-L., Ravishankar, C.V.: Spatial Hash-Joins, SIGMOD, pp. 247–258 (1996)
Maritz J.S., Munro A.H.: On the use of the generalized extreme value distribution in estimating extreme percentiles. Biometrics 23, 79–103 (1976)
Article MathSciNet Google Scholar
Neill, D.B., Moore, A.W.: A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters, NIPS, pp. 256–265 (2003)
Neill, D.B., Moore, A.W.: Rapid detection of significant spatial clusters, KDD, pp. 256–265 (2004)
Neill, D.B., Moore, A.W., Sabhnani, M., Daniel, K.: Detection of emerging space-time clusters, KDD, pp. 218–227 (2005)
Olken, F.: Random Sampling from Databases, LBL Technical Report, LBL-32883 (1993)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, SIGMOD, pp. 427–438 (2000)
Robert C.P., Casella G.: Monte Carlo Statistic Methods. Springer, Berlin (2004)
Google Scholar
Sarndal C.-E., Swensson B., Wretman J.: Model Assisted Survey Sampling. Springer, Berlin (1992)
Google Scholar
Schaefer, G., Stich, M.: UCID—An Uncompressed Colour Image Database, SPIE, Storage and Retrieval Methods and Applications for Multimedia, pp. 472–480 (2004)
Seidl, T., Kriegel, H.-P.: Efficient User-Adaptable Similarity Search in Large Multimedia Databases, VLDB, pp. 506–515 (1997)
Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing, SIGMOD, pp. 343–354 (2000)
Wilks S.S.: The large sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938)
Article MATH Google Scholar
Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial Anomlay Detection, CISE Technical Report, University of Florida (2008)

Download references

Author information

Authors and Affiliations

Computer and Information Science and Engineering Department, University of Florida, Gainesville, FL, 32611, USA
Mingxi Wu & Chris Jermaine

Authors

Mingxi Wu
View author publications
You can also search for this author inPubMed Google Scholar
Chris Jermaine
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mingxi Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, M., Jermaine, C. Guessing the extreme values in a data set: a Bayesian method and its applications. The VLDB Journal 18, 571–597 (2009). https://doi.org/10.1007/s00778-009-0133-6

Download citation

Received: 19 March 2008
Revised: 01 December 2008
Accepted: 10 December 2008
Published: 12 February 2009
Issue Date: April 2009
DOI: https://doi.org/10.1007/s00778-009-0133-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Guessing the extreme values in a data set: a Bayesian method and its applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Estimation by Means of Scaled Bregman Power Distances. Part I. Non-homogeneous Data

Data Pre-processing Solution Using Statistical and Data Mining Techniques

Outliers and the Simpson’s Paradox

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Guessing the extreme values in a data set: a Bayesian method and its applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Estimation by Means of Scaled Bregman Power Distances. Part I. Non-homogeneous Data

Data Pre-processing Solution Using Statistical and Data Mining Techniques

Outliers and the Simpson’s Paradox

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now