Skip to main content

Multiple Hypothesis Testing in Pattern Discovery

  • Conference paper
Discovery Science (DS 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6926))

Included in the following conference series:

Abstract

The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used in a generic data mining setting. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive). We show the power of our solution on real data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5(3), 213–246 (2001)

    Article  MATH  Google Scholar 

  2. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  3. Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple hypothesis testing in microarray experiments. Statistical Science 18(1), 71–103 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  4. Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1(3) (2007)

    Google Scholar 

  5. Hanhijärvi, S., Garriga, G.C., Puolamäki, K.: Randomization techniques for graphs. In: Proceedings of the Ninth SIAM International Conference on Data Mining, SDM 2009 (2009)

    Google Scholar 

  6. Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., Mannila, H.: Tell me something i don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 379–388. ACM, New York (2009)

    Google Scholar 

  7. Kuramochi, M., Karypis, G.: An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering 16(9), 1038–1051 (2004)

    Article  Google Scholar 

  8. Lallich, S., Teytaud, O., Prudhomme, E.: Association rule interestingness: measure and statistical validation. Quality Measures in Data Mining, 251–275 (2006)

    Google Scholar 

  9. Lallich, S., Teytaud, O., Prudhomme, E.: Statistical inference and data mining: false discoveries control. In: 17th COMPSTAT Symposium of the IASC, La Sapienza, Rome, pp. 325–336 (2006)

    Google Scholar 

  10. Megiddo, N., Srikant, R.: Discovering predictive association rules. In: Knowledge Discovery and Data Mining, pp. 274–278 (1998)

    Google Scholar 

  11. North, B.V., Curtis, D., Sham, P.C.: A note on the calculation of empirical P values from Monte Carlo procedures. The American Journal of Human Genetics 71(2), 439–441 (2002)

    Article  Google Scholar 

  12. Ojala, M., Vuokko, N., Kallio, A., Haiminen, N., Mannila, H.: Assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining 2, 209–230 (2009)

    Article  MathSciNet  Google Scholar 

  13. Webb, G.: Discovering significant patterns. Machine Learning 68, 1–33 (2007)

    Article  Google Scholar 

  14. Webb, G.: Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 307–323 (2008)

    Article  Google Scholar 

  15. Webb, G.I.: Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 434–443. ACM, New York (2006)

    Google Scholar 

  16. Westfall, P.H., Young, S.S.: Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, Chichester (1993)

    MATH  Google Scholar 

  17. Ying, X., Wu, X.: Graph generation with predescribed feature constraints. In: Proceedings of the Ninth SIAM International Conference on Data Mining, SDM 2009 (2009)

    Google Scholar 

  18. Zhang, H., Padmanabhan, B., Tuzhilin, A.: On the discovery of significant statistical quantitative rules. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 374–383. ACM, New York (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hanhijärvi, S. (2011). Multiple Hypothesis Testing in Pattern Discovery. In: Elomaa, T., Hollmén, J., Mannila, H. (eds) Discovery Science. DS 2011. Lecture Notes in Computer Science(), vol 6926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24477-3_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24476-6

  • Online ISBN: 978-3-642-24477-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics