research-article

Significance and Coverage in Group Testing on the Social Web

Authors:

Nassim Bouarour,

Idir Benouaret,

Sihem Amer-YahiaAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 3052 - 3060

https://doi.org/10.1145/3485447.3512025

Published: 25 April 2022 Publication History

Abstract

We tackle the longstanding question of checking hypotheses on the social Web. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input data. We develop GroupTest, a framework for group testing that addresses both challenges. We formulate CoverTest, a generic top-n problem that seeks n user groups satisfying one-sample, two-sample, or multiple-sample tests, and maximizing data coverage. We show the hardness of CoverTest and develop a greedy algorithm with a provable approximation guarantee as well as a faster heuristic-based algorithm based on α-investing. Our extensive experiments on four real-world datasets demonstrate the necessity to optimize coverage for sound data-driven discoveries, and the efficiency of our heuristic-based algorithm.

References

[1]

A. A. Ageev and M. I. Sviridenko. 1999. Approximation algorithms for maximum coverage and max cut with given sizes of parts. In International Conference on Integer Programming and Combinatorial Optimization. Springer, 17–30.

[2]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. 27. ACM.

[3]

S. Amer-Yahia, S. Kleisarchaki, N. K. Kolloju, L. V. Lakshmanan, and R. H. Zamar. 2017. Exploring Rated Datasets with Rating Maps. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1411–1419.

Digital Library

[4]

G. Beliakov, S. James, J. Mordelová, T. Rückschlossová, and R. R. Yager. 2010. Generalized Bonferroni mean operators in multi-criteria aggregation. Fuzzy Sets Syst. 161, 17 (2010), 2227–2242.

Digital Library

[5]

Y. Benjamini and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29(2001), 1165–1188.

[6]

M. Boley, M. Mampaey, B. Kang, P. Tokmakov, and S. Wrobel. 2013. One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, 27–35.

[7]

C. Bron and J. Kerbosch. 1973. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16, 9 (1973), 575–577.

Digital Library

[8]

D. Colquhoun. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1, 3 (2014), 140216.

[9]

G. Di Leo and F. Sardanelli. 2020. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach. European radiology experimental 4, 1 (2020), 1–8.

[10]

D. Foster and R. A. Stine. 2008. Alpha-Investing: A Procedure for Sequential Control of Expected False Discoveries. In Journal of the Royal Statistical Society: Series B: Statistical Methodology.

[11]

A. Goyal, F. Bonchi, and L. V. Lakshmanan. 2008. Discovering leaders from community actions. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 499–508.

Digital Library

[12]

S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, and D. G. Altman. 2016. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31, 4 (2016), 337–350. http://dx.doi.org/10.1007/s10654-016-0149-3

[13]

W. Hämäläinen and G. I. Webb. 2019. A tutorial on statistically sound pattern discovery. Data Min. Knowl. Discov. 33, 2 (2019), 325–377.

Digital Library

[14]

D. S. Hochbaum and A. Pathria. 1998. Analysis of the greedy approach in problems of maximum k-coverage. Naval Research Logistics (NRL) 45, 6 (1998), 615–627.

[15]

M. Jafari and N. Ansari-Pour. 2019. Why, when and how to adjust your P values?Cell Journal (Yakhteh) 20, 4 (2019), 604.

[16]

D. Jiang, Q. Cai, G. Chen, H. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. Tung. 2016. Cohort query processing. Proceedings of the VLDB Endowment 10, 1 (2016), 1–12.

Digital Library

[17]

N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. 2014. Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 472–483.

[18]

R. J. Meijer and J. J. Goeman. 2016. Multiple Testing of Gene Sets from Gene Ontology: Possibilities and Pitfalls. Briefings Bioinform. 17, 5 (2016), 808–818.

[19]

B. Mieth, J. A. Rodriguez-Parez, C. Morcillo-Suaez, X. Farra, A. Navarro, and K.-R. Maeller. [n.d.]. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. ([n. d.]).

[20]

M. E. Newman. 2004. Detecting community structure in networks. The European Physical Journal B-Condensed Matter and Complex Systems 38, 2(2004), 321–330.

[21]

A. G. Nikolaev, S. Gore, and V. Govindaraju. 2016. Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platforms. In KDD. 225–234.

[22]

P. Pedreira, C. Croswhite, and L. Bona. 2016. Cubrick: indexing millions of records per second for interactive analytics. Proceedings of the VLDB Endowment 9, 13 (2016), 1305–1316.

Digital Library

[23]

L. Pellegrina, M. Riondato, and F. Vandin. 2019. Hypothesis Testing and Statistically-sound Pattern Mining (tutorial). In Proc. of the 25th ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. 3215–3216.

Digital Library

[24]

E. Roquain. 2011. Type I error rate control for testing many hypotheses: a survey with proofs. arxiv:1012.4078

[25]

R. Srikant and R. Agrawal. 1995. Mining generalized association rules. ACM (1995).

[26]

G. I. Webb. 2007. Discovering Significant Patterns. Mach. Learn. 68, 1 (2007), 1–33.

Digital Library

[27]

G. I. Webb and F. Petitjean. 2016. A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests. In Proc. of the 22nd ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, San Francisco, USA, Aug. 2016. 1255–1264.

[28]

D. Xin, X. Shen, Q. Mei, and J. Han. 2006. Discovering interesting patterns through user’s interactive feedback. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 773–778.

[29]

Z. Zhao, L. D. Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. 2017. Controlling False Discoveries During Interactive Data Exploration. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. ACM, 527–540.

Digital Library

Index Terms

Significance and Coverage in Group Testing on the Social Web
1. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

State coverage: a structural test adequacy criterion for behavior checking
ESEC-FSE '07: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

We propose a new language-independent, structural test adequacy criterion called state coverage. State coverage measures whether unit-level tests check the outputs and sideeffects of a program.

State coverage differs in several respects from existing ...
State coverage: a structural test adequacy criterion for behavior checking
ESEC-FSE companion '07: The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers

We propose a new language-independent, structural test adequacy criterion called state coverage. State coverage measures whether unit-level tests check the outputs and side effects of a program.

State coverage differs in several respects from existing ...
Coverage is not strongly correlated with test suite effectiveness
ICSE 2014: Proceedings of the 36th International Conference on Software Engineering

The coverage of a test suite is often used as a proxy for its ability to detect faults. However, previous studies that investigated the correlation between code coverage and test suite effectiveness have failed to reach a consensus about the nature and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
157
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten