Skip to main content

Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems LI

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 13410))

  • 154 Accesses

Abstract

We tackle the question of checking hypotheses on user data. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input (data coverage). We develop GroupTest, a framework for group testing that addresses both challenges. We formulate ValMin and CovMax, two generic top-n problems that seek n user groups satisfying one-sample, two-sample, or multiple-sample tests. ValMin optimizes significance while setting a constraint on data coverage and CovMax aims to maximize data coverage while controlling significance. We show the hardness of ValMin and CovMax. We develop a greedy algorithm to solve the former problem and two algorithms to solve the latter where the first one is a greedy algorithm with a provable approximation guarantee and the second one is a heuristic-based algorithm based on \(\alpha \)-investing. Our extensive experiments on real-world datasets demonstrate the necessity to optimize coverage for sound discoveries on large datasets, and the efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.gartner.com/smarterwithgartner/gartner-top-strategic-technology-trends-for-2021/.

  2. 2.

    https://github.com/statistical-group-testing/statistically-soundgrouping.

References

  1. Ageev, A.A., Sviridenko, M.I.: Approximation algorithms for maximum coverage and max cut with given sizes of parts. In: Cornuéjols, G., Burkard, R.E., Woeginger, G.J. (eds.) IPCO 1999. LNCS, vol. 1610, pp. 17–30. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48777-8_2

    Chapter  MATH  Google Scholar 

  2. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)

    Google Scholar 

  3. Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V., Zamar, R.H.: Exploring rated datasets with rating maps. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1411–1419. International World Wide Web Conferences Steering Committee (2017)

    Google Scholar 

  4. Beliakov, G., James, S., Mordelová, J., Rückschlossová, T., Yager, R.R.: Generalized bonferroni mean operators in multi-criteria aggregation. Fuzzy Sets Syst. 161(17), 2227–2242 (2010)

    Article  MathSciNet  Google Scholar 

  5. Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)

    Article  MathSciNet  Google Scholar 

  6. Boley, M., Mampaey, M., Kang, B., Tokmakov, P., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 27–35. ACM (2013)

    Google Scholar 

  7. Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973)

    Article  Google Scholar 

  8. Chekuri, C., Quanrud, K., Zhang, Z.: On approximating partial set cover and generalizations. arXiv preprint arXiv:1907.04413 (2019)

  9. Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open Sci. 1(3), 140216 (2014)

    Article  MathSciNet  Google Scholar 

  10. Di Leo, G., Sardanelli, F.: Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. Eur. Radiol. Exp. 4(1), 1–8 (2020)

    Google Scholar 

  11. Foster, D., Stine, R.A.: Alpha-investing: a procedure for sequential control of expected false discoveries. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 70(2), 429–444 (2008)

    Google Scholar 

  12. Goyal, A., Bonchi, F., Lakshmanan, L.V.: Discovering leaders from community actions. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 499–508. ACM (2008)

    Google Scholar 

  13. Greenland, S., et al.: Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016). https://doi.org/10.1007/s10654-016-0149-3

    Article  Google Scholar 

  14. Hämäläinen, W., Webb, G.I.: A tutorial on statistically sound pattern discovery. Data Min. Knowl. Disc. 33(2), 325–377 (2018). https://doi.org/10.1007/s10618-018-0590-x

    Article  MathSciNet  MATH  Google Scholar 

  15. Hochbaum, D.S., Pathria, A.: Analysis of the greedy approach in problems of maximum k-coverage. Nav. Res. Logist. (NRL) 45(6), 615–627 (1998)

    Article  MathSciNet  Google Scholar 

  16. Jafari, M., Ansari-Pour, N.: Why, when and how to adjust your p values? Cell J. (Yakhteh) 20(4), 604 (2019)

    Google Scholar 

  17. Jiang, D., et al.: Cohort query processing. Proce. VLDB Endow. 10(1), 1–12 (2016)

    Article  Google Scholar 

  18. Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 472–483. IEEE (2014)

    Google Scholar 

  19. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9

    Chapter  Google Scholar 

  20. Meijer, R.J., Goeman, J.J.: Multiple testing of gene sets from gene ontology: possibilities and pitfalls. Briefings Bioinform. 17(5), 808–818 (2016)

    Article  Google Scholar 

  21. Mieth, B., et al.: Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 6(1), 1–14 (2016)

    Article  Google Scholar 

  22. Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B 38(2), 321–330 (2004). https://doi.org/10.1140/epjb/e2004-00124-y

    Article  Google Scholar 

  23. Nikolaev, A.G., Gore, S., Govindaraju, V.: Engagement capacity and engaging team formation for reach maximization of online social media platforms. In: KDD, pp. 225–234 (2016)

    Google Scholar 

  24. Pedreira, P., Croswhite, C., Bona, L.: Cubrick: indexing millions of records per second for interactive analytics. Proc. VLDB Endow. 9(13), 1305–1316 (2016)

    Article  Google Scholar 

  25. Pellegrina, L., Riondato, M., Vandin, F.: Hypothesis testing and statistically-sound pattern mining (tutorial). In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 3215–3216 (2019)

    Google Scholar 

  26. Roquain, E.: Type i error rate control for testing many hypotheses: a survey with proofs. Journal de la Société Française de Statistique 152(2), 3–38 (2011)

    MathSciNet  MATH  Google Scholar 

  27. Srikant, R., Agrawal, R.: Mining generalized association rules. Futur. Gener. Comput. Syst. 13(2–3), 161–180 (1997)

    Article  Google Scholar 

  28. Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)

    Article  Google Scholar 

  29. Webb, G.I., Petitjean, F.: A multiple test correction for streams and cascades of statistical hypothesis tests. In: Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Francisco, USA, August 2016, pp. 1255–1264 (2016)

    Google Scholar 

  30. Xin, D., Shen, X., Mei, Q., Han, J.: Discovering interesting patterns through user’s interactive feedback. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 773–778. ACM (2006)

    Google Scholar 

  31. Zgraggen, E., Zhao, Z., Zeleznik, R., Kraska, T.: Investigating the effect of the multiple comparisons problem in visual analysis. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2018)

    Google Scholar 

  32. Zhao, Z., Stefani, L.D., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 527–540. ACM (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nassim Bouarour .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bouarour, N., Benouaret, I., Amer-Yahia, S. (2022). Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-66111-6_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-66110-9

  • Online ISBN: 978-3-662-66111-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics