Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups

Bouarour, Nassim; Benouaret, Idir; Amer-Yahia, Sihem

doi:10.1007/978-3-662-66111-6_3

Nassim Bouarour¹¹,
Idir Benouaret¹¹ &
Sihem Amer-Yahia¹¹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 13410))

154 Accesses

Abstract

We tackle the question of checking hypotheses on user data. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input (data coverage). We develop GroupTest, a framework for group testing that addresses both challenges. We formulate ValMin and CovMax, two generic top-n problems that seek n user groups satisfying one-sample, two-sample, or multiple-sample tests. ValMin optimizes significance while setting a constraint on data coverage and CovMax aims to maximize data coverage while controlling significance. We show the hardness of ValMin and CovMax. We develop a greedy algorithm to solve the former problem and two algorithms to solve the latter where the first one is a greedy algorithm with a provable approximation guarantee and the second one is a heuristic-based algorithm based on \(\alpha \)-investing. Our extensive experiments on real-world datasets demonstrate the necessity to optimize coverage for sound discoveries on large datasets, and the efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ageev, A.A., Sviridenko, M.I.: Approximation algorithms for maximum coverage and max cut with given sizes of parts. In: Cornuéjols, G., Burkard, R.E., Woeginger, G.J. (eds.) IPCO 1999. LNCS, vol. 1610, pp. 17–30. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48777-8_2
Chapter MATH Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Google Scholar
Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V., Zamar, R.H.: Exploring rated datasets with rating maps. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1411–1419. International World Wide Web Conferences Steering Committee (2017)
Google Scholar
Beliakov, G., James, S., Mordelová, J., Rückschlossová, T., Yager, R.R.: Generalized bonferroni mean operators in multi-criteria aggregation. Fuzzy Sets Syst. 161(17), 2227–2242 (2010)
Article MathSciNet Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)
Article MathSciNet Google Scholar
Boley, M., Mampaey, M., Kang, B., Tokmakov, P., Wrobel, S.: One click mining: interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 27–35. ACM (2013)
Google Scholar
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973)
Article Google Scholar
Chekuri, C., Quanrud, K., Zhang, Z.: On approximating partial set cover and generalizations. arXiv preprint arXiv:1907.04413 (2019)
Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open Sci. 1(3), 140216 (2014)
Article MathSciNet Google Scholar
Di Leo, G., Sardanelli, F.: Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. Eur. Radiol. Exp. 4(1), 1–8 (2020)
Google Scholar
Foster, D., Stine, R.A.: Alpha-investing: a procedure for sequential control of expected false discoveries. J. Roy. Stat. Soc. Ser. B Stat. Methodol. 70(2), 429–444 (2008)
Google Scholar
Goyal, A., Bonchi, F., Lakshmanan, L.V.: Discovering leaders from community actions. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 499–508. ACM (2008)
Google Scholar
Greenland, S., et al.: Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016). https://doi.org/10.1007/s10654-016-0149-3
Article Google Scholar
Hämäläinen, W., Webb, G.I.: A tutorial on statistically sound pattern discovery. Data Min. Knowl. Disc. 33(2), 325–377 (2018). https://doi.org/10.1007/s10618-018-0590-x
Article MathSciNet MATH Google Scholar
Hochbaum, D.S., Pathria, A.: Analysis of the greedy approach in problems of maximum k-coverage. Nav. Res. Logist. (NRL) 45(6), 615–627 (1998)
Article MathSciNet Google Scholar
Jafari, M., Ansari-Pour, N.: Why, when and how to adjust your p values? Cell J. (Yakhteh) 20(4), 604 (2019)
Google Scholar
Jiang, D., et al.: Cohort query processing. Proce. VLDB Endow. 10(1), 1–12 (2016)
Article Google Scholar
Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 472–483. IEEE (2014)
Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Chapter Google Scholar
Meijer, R.J., Goeman, J.J.: Multiple testing of gene sets from gene ontology: possibilities and pitfalls. Briefings Bioinform. 17(5), 808–818 (2016)
Article Google Scholar
Mieth, B., et al.: Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 6(1), 1–14 (2016)
Article Google Scholar
Newman, M.E.J.: Detecting community structure in networks. Eur. Phys. J. B 38(2), 321–330 (2004). https://doi.org/10.1140/epjb/e2004-00124-y
Article Google Scholar
Nikolaev, A.G., Gore, S., Govindaraju, V.: Engagement capacity and engaging team formation for reach maximization of online social media platforms. In: KDD, pp. 225–234 (2016)
Google Scholar
Pedreira, P., Croswhite, C., Bona, L.: Cubrick: indexing millions of records per second for interactive analytics. Proc. VLDB Endow. 9(13), 1305–1316 (2016)
Article Google Scholar
Pellegrina, L., Riondato, M., Vandin, F.: Hypothesis testing and statistically-sound pattern mining (tutorial). In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 3215–3216 (2019)
Google Scholar
Roquain, E.: Type i error rate control for testing many hypotheses: a survey with proofs. Journal de la Société Française de Statistique 152(2), 3–38 (2011)
MathSciNet MATH Google Scholar
Srikant, R., Agrawal, R.: Mining generalized association rules. Futur. Gener. Comput. Syst. 13(2–3), 161–180 (1997)
Article Google Scholar
Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
Article Google Scholar
Webb, G.I., Petitjean, F.: A multiple test correction for streams and cascades of statistical hypothesis tests. In: Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Francisco, USA, August 2016, pp. 1255–1264 (2016)
Google Scholar
Xin, D., Shen, X., Mei, Q., Han, J.: Discovering interesting patterns through user’s interactive feedback. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 773–778. ACM (2006)
Google Scholar
Zgraggen, E., Zhao, Z., Zeleznik, R., Kraska, T.: Investigating the effect of the multiple comparisons problem in visual analysis. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2018)
Google Scholar
Zhao, Z., Stefani, L.D., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 527–540. ACM (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, Univ. Grenoble Alpes, Grenoble, France
Nassim Bouarour, Idir Benouaret & Sihem Amer-Yahia

Authors

Nassim Bouarour
View author publications
You can also search for this author in PubMed Google Scholar
Idir Benouaret
View author publications
You can also search for this author in PubMed Google Scholar
Sihem Amer-Yahia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nassim Bouarour .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
IFS, Technical University of Vienna, Vienna, Austria
A Min Tjoa
University of Montpellier, Montpellier, France
Esther Pacitti
University of Rennes 1, Rennes, France
Zoltan Miklos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bouarour, N., Benouaret, I., Amer-Yahia, S. (2022). Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-66111-6_3
Published: 08 October 2022
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-66110-9
Online ISBN: 978-3-662-66111-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups