Abstract
We tackle the question of checking hypotheses on user data. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input (data coverage). We develop GroupTest, a framework for group testing that addresses both challenges. We formulate ValMin and CovMax, two generic top-n problems that seek n user groups satisfying one-sample, two-sample, or multiple-sample tests. ValMin optimizes significance while setting a constraint on data coverage and CovMax aims to maximize data coverage while controlling significance. We show the hardness of ValMin and CovMax. We develop a greedy algorithm to solve the former problem and two algorithms to solve the latter where the first one is a greedy algorithm with a provable approximation guarantee and the second one is a heuristic-based algorithm based on \(\alpha \)-investing. Our extensive experiments on real-world datasets demonstrate the necessity to optimize coverage for sound discoveries on large datasets, and the efficiency of our algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ageev, A.A., Sviridenko, M.I.: Approximation algorithms for maximum coverage and max cut with given sizes of parts. In: Cornuéjols, G., Burkard, R.E., Woeginger, G.J. (eds.) IPCO 1999. LNCS, vol. 1610, pp. 17–30. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48777-8_2
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V., Zamar, R.H.: Exploring rated datasets with rating maps. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1411–1419. International World Wide Web Conferences Steering Committee (2017)
Beliakov, G., James, S., Mordelová, J., Rückschlossová, T., Yager, R.R.: Generalized bonferroni mean operators in multi-criteria aggregation. Fuzzy Sets Syst. 161(17), 2227–2242 (2010)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)
Boley, M., Mampaey, M., Kang, B., Tokmakov, P., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 27–35. ACM (2013)
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973)
Chekuri, C., Quanrud, K., Zhang, Z.: On approximating partial set cover and generalizations. arXiv preprint arXiv:1907.04413 (2019)
Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open Sci. 1(3), 140216 (2014)
Di Leo, G., Sardanelli, F.: Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. Eur. Radiol. Exp. 4(1), 1–8 (2020)
Foster, D., Stine, R.A.: Alpha-investing: a procedure for sequential control of expected false discoveries. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 70(2), 429–444 (2008)
Goyal, A., Bonchi, F., Lakshmanan, L.V.: Discovering leaders from community actions. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 499–508. ACM (2008)
Greenland, S., et al.: Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016). https://doi.org/10.1007/s10654-016-0149-3
Hämäläinen, W., Webb, G.I.: A tutorial on statistically sound pattern discovery. Data Min. Knowl. Disc. 33(2), 325–377 (2018). https://doi.org/10.1007/s10618-018-0590-x
Hochbaum, D.S., Pathria, A.: Analysis of the greedy approach in problems of maximum k-coverage. Nav. Res. Logist. (NRL) 45(6), 615–627 (1998)
Jafari, M., Ansari-Pour, N.: Why, when and how to adjust your p values? Cell J. (Yakhteh) 20(4), 604 (2019)
Jiang, D., et al.: Cohort query processing. Proce. VLDB Endow. 10(1), 1–12 (2016)
Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 472–483. IEEE (2014)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Meijer, R.J., Goeman, J.J.: Multiple testing of gene sets from gene ontology: possibilities and pitfalls. Briefings Bioinform. 17(5), 808–818 (2016)
Mieth, B., et al.: Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 6(1), 1–14 (2016)
Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B 38(2), 321–330 (2004). https://doi.org/10.1140/epjb/e2004-00124-y
Nikolaev, A.G., Gore, S., Govindaraju, V.: Engagement capacity and engaging team formation for reach maximization of online social media platforms. In: KDD, pp. 225–234 (2016)
Pedreira, P., Croswhite, C., Bona, L.: Cubrick: indexing millions of records per second for interactive analytics. Proc. VLDB Endow. 9(13), 1305–1316 (2016)
Pellegrina, L., Riondato, M., Vandin, F.: Hypothesis testing and statistically-sound pattern mining (tutorial). In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 3215–3216 (2019)
Roquain, E.: Type i error rate control for testing many hypotheses: a survey with proofs. Journal de la Société Française de Statistique 152(2), 3–38 (2011)
Srikant, R., Agrawal, R.: Mining generalized association rules. Futur. Gener. Comput. Syst. 13(2–3), 161–180 (1997)
Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
Webb, G.I., Petitjean, F.: A multiple test correction for streams and cascades of statistical hypothesis tests. In: Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Francisco, USA, August 2016, pp. 1255–1264 (2016)
Xin, D., Shen, X., Mei, Q., Han, J.: Discovering interesting patterns through user’s interactive feedback. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 773–778. ACM (2006)
Zgraggen, E., Zhao, Z., Zeleznik, R., Kraska, T.: Investigating the effect of the multiple comparisons problem in visual analysis. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2018)
Zhao, Z., Stefani, L.D., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 527–540. ACM (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Bouarour, N., Benouaret, I., Amer-Yahia, S. (2022). Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-66111-6_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-66110-9
Online ISBN: 978-3-662-66111-6
eBook Packages: Computer ScienceComputer Science (R0)