Skip to main content
Log in

Guessing the extreme values in a data set: a Bayesian method and its applications

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal, D., McGregor, A., Phillips, J.M., Venkatasubramanian, S., Zhu, Z.: Spatial scan statistics: approximations and performance study, KDD, pp. 24–33 (2006)

  2. Agarwal, D., Phillips, J.M., Venkatasubramanian, S.: The Hunting of the Bump: On Maximizing Statistical Discrepancy, SODA, pp. 1137–1146 (2006)

  3. Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join, VLDB, pp. 570–581 (1998)

  4. Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule, KDD, pp. 29–38 (2003)

  5. Bazaraa M.S., Sherali H.D., Shetty C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, New York (1993)

    MATH  Google Scholar 

  6. Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, University of Berkeley, ICSI-TR-97-021 (1997)

  7. Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees, SIGMOD, pp. 237–246 (1993)

  8. Casella G., Berger R.L.: Statistical Inference, 2nd edn. Duxbury Press, North Scituate (2001)

    Google Scholar 

  9. Donjerkovic, D., Ramakrishnan, R.: Probabilistic Optimization of Top N Queries, VLDB, pp. 411–422 (1999)

  10. Dudoit S., Shaffer J.P., Boldrick J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  11. Haas, P.J., Hellerstein, J.M.: Ripple Joins for Online Aggregation, SIGMOD, pp. 287–298 (1999)

  12. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation, SIGMOD, pp. 171–182 (1997)

  13. Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases, SIGMOD, pp. 237–248 (1998)

  14. Hou W.-C., Özsoyoglu G.: Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst. 16, 600–654 (1991)

    Article  Google Scholar 

  15. http://www.armprogram.com/

  16. Kinnison R.R.: Applied Extreme Value Statistics. Macmillan, New York (1985)

    Google Scholar 

  17. Knorr E.M., Ng R.T., Tucakov V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000)

    Article  Google Scholar 

  18. Kulldorff M.: A spatial scan statistic. Comm. Stat. Theory Methods 26, 1481–1496 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  19. Kulldorff, M.: Spatial scan statistics: model, calculations, and applications, Scan Statistics and Applications, pp. 303–322 (1999)

  20. Leadbetter M.R., Lindgren G., Rootzen H.: Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics. Springer, Berlin (1983)

    MATH  Google Scholar 

  21. Lee P.M.: Bayesian Statistics: An Introduction. Hodder Arnold, London (1997)

    MATH  Google Scholar 

  22. Lo, M.-L., Ravishankar, C.V.: Spatial Hash-Joins, SIGMOD, pp. 247–258 (1996)

  23. Maritz J.S., Munro A.H.: On the use of the generalized extreme value distribution in estimating extreme percentiles. Biometrics 23, 79–103 (1976)

    Article  MathSciNet  Google Scholar 

  24. Neill, D.B., Moore, A.W.: A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters, NIPS, pp. 256–265 (2003)

  25. Neill, D.B., Moore, A.W.: Rapid detection of significant spatial clusters, KDD, pp. 256–265 (2004)

  26. Neill, D.B., Moore, A.W., Sabhnani, M., Daniel, K.: Detection of emerging space-time clusters, KDD, pp. 218–227 (2005)

  27. Olken, F.: Random Sampling from Databases, LBL Technical Report, LBL-32883 (1993)

  28. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, SIGMOD, pp. 427–438 (2000)

  29. Robert C.P., Casella G.: Monte Carlo Statistic Methods. Springer, Berlin (2004)

    Google Scholar 

  30. Sarndal C.-E., Swensson B., Wretman J.: Model Assisted Survey Sampling. Springer, Berlin (1992)

    Google Scholar 

  31. Schaefer, G., Stich, M.: UCID—An Uncompressed Colour Image Database, SPIE, Storage and Retrieval Methods and Applications for Multimedia, pp. 472–480 (2004)

  32. Seidl, T., Kriegel, H.-P.: Efficient User-Adaptable Similarity Search in Large Multimedia Databases, VLDB, pp. 506–515 (1997)

  33. Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing, SIGMOD, pp. 343–354 (2000)

  34. Wilks S.S.: The large sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938)

    Article  MATH  Google Scholar 

  35. Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial Anomlay Detection, CISE Technical Report, University of Florida (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingxi Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, M., Jermaine, C. Guessing the extreme values in a data set: a Bayesian method and its applications. The VLDB Journal 18, 571–597 (2009). https://doi.org/10.1007/s00778-009-0133-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0133-6

Keywords