skip to main content
10.1145/3485447.3512025acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Significance and Coverage in Group Testing on the Social Web

Published: 25 April 2022 Publication History

Abstract

We tackle the longstanding question of checking hypotheses on the social Web. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input data. We develop GroupTest, a framework for group testing that addresses both challenges. We formulate CoverTest, a generic top-n problem that seeks n user groups satisfying one-sample, two-sample, or multiple-sample tests, and maximizing data coverage. We show the hardness of CoverTest and develop a greedy algorithm with a provable approximation guarantee as well as a faster heuristic-based algorithm based on α-investing. Our extensive experiments on four real-world datasets demonstrate the necessity to optimize coverage for sound data-driven discoveries, and the efficiency of our heuristic-based algorithm.

References

[1]
A. A. Ageev and M. I. Sviridenko. 1999. Approximation algorithms for maximum coverage and max cut with given sizes of parts. In International Conference on Integer Programming and Combinatorial Optimization. Springer, 17–30.
[2]
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. 27. ACM.
[3]
S. Amer-Yahia, S. Kleisarchaki, N. K. Kolloju, L. V. Lakshmanan, and R. H. Zamar. 2017. Exploring Rated Datasets with Rating Maps. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1411–1419.
[4]
G. Beliakov, S. James, J. Mordelová, T. Rückschlossová, and R. R. Yager. 2010. Generalized Bonferroni mean operators in multi-criteria aggregation. Fuzzy Sets Syst. 161, 17 (2010), 2227–2242.
[5]
Y. Benjamini and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29(2001), 1165–1188.
[6]
M. Boley, M. Mampaey, B. Kang, P. Tokmakov, and S. Wrobel. 2013. One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, 27–35.
[7]
C. Bron and J. Kerbosch. 1973. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16, 9 (1973), 575–577.
[8]
D. Colquhoun. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1, 3 (2014), 140216.
[9]
G. Di Leo and F. Sardanelli. 2020. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach. European radiology experimental 4, 1 (2020), 1–8.
[10]
D. Foster and R. A. Stine. 2008. Alpha-Investing: A Procedure for Sequential Control of Expected False Discoveries. In Journal of the Royal Statistical Society: Series B: Statistical Methodology.
[11]
A. Goyal, F. Bonchi, and L. V. Lakshmanan. 2008. Discovering leaders from community actions. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 499–508.
[12]
S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, and D. G. Altman. 2016. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31, 4 (2016), 337–350. http://dx.doi.org/10.1007/s10654-016-0149-3
[13]
W. Hämäläinen and G. I. Webb. 2019. A tutorial on statistically sound pattern discovery. Data Min. Knowl. Discov. 33, 2 (2019), 325–377.
[14]
D. S. Hochbaum and A. Pathria. 1998. Analysis of the greedy approach in problems of maximum k-coverage. Naval Research Logistics (NRL) 45, 6 (1998), 615–627.
[15]
M. Jafari and N. Ansari-Pour. 2019. Why, when and how to adjust your P values?Cell Journal (Yakhteh) 20, 4 (2019), 604.
[16]
D. Jiang, Q. Cai, G. Chen, H. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. Tung. 2016. Cohort query processing. Proceedings of the VLDB Endowment 10, 1 (2016), 1–12.
[17]
N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. 2014. Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 472–483.
[18]
R. J. Meijer and J. J. Goeman. 2016. Multiple Testing of Gene Sets from Gene Ontology: Possibilities and Pitfalls. Briefings Bioinform. 17, 5 (2016), 808–818.
[19]
B. Mieth, J. A. Rodriguez-Parez, C. Morcillo-Suaez, X. Farra, A. Navarro, and K.-R. Maeller. [n.d.]. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. ([n. d.]).
[20]
M. E. Newman. 2004. Detecting community structure in networks. The European Physical Journal B-Condensed Matter and Complex Systems 38, 2(2004), 321–330.
[21]
A. G. Nikolaev, S. Gore, and V. Govindaraju. 2016. Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platforms. In KDD. 225–234.
[22]
P. Pedreira, C. Croswhite, and L. Bona. 2016. Cubrick: indexing millions of records per second for interactive analytics. Proceedings of the VLDB Endowment 9, 13 (2016), 1305–1316.
[23]
L. Pellegrina, M. Riondato, and F. Vandin. 2019. Hypothesis Testing and Statistically-sound Pattern Mining (tutorial). In Proc. of the 25th ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. 3215–3216.
[24]
E. Roquain. 2011. Type I error rate control for testing many hypotheses: a survey with proofs. arxiv:1012.4078
[25]
R. Srikant and R. Agrawal. 1995. Mining generalized association rules. ACM (1995).
[26]
G. I. Webb. 2007. Discovering Significant Patterns. Mach. Learn. 68, 1 (2007), 1–33.
[27]
G. I. Webb and F. Petitjean. 2016. A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests. In Proc. of the 22nd ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, San Francisco, USA, Aug. 2016. 1255–1264.
[28]
D. Xin, X. Shen, Q. Mei, and J. Han. 2006. Discovering interesting patterns through user’s interactive feedback. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 773–778.
[29]
Z. Zhao, L. D. Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. 2017. Controlling False Discoveries During Interactive Data Exploration. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. ACM, 527–540.

Index Terms

  1. Significance and Coverage in Group Testing on the Social Web
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '22: Proceedings of the ACM Web Conference 2022
    April 2022
    3764 pages
    ISBN:9781450390965
    DOI:10.1145/3485447
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 April 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. coverage
    2. exploratory data analysis
    3. hypothesis testing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '22
    Sponsor:
    WWW '22: The ACM Web Conference 2022
    April 25 - 29, 2022
    Virtual Event, Lyon, France

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 157
      Total Downloads
    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media