Decision tree Thompson sampling for mining hidden populations through attributed search

Kumar, Suhansanu; Gao, Heting; Wang, Changyu; Chang, Kevin Chen-Chuan; Sundaram, Hari

doi:10.1007/s13278-021-00812-5

Decision tree Thompson sampling for mining hidden populations through attributed search

Original Article
Published: 15 November 2021

Volume 12, article number 6, (2022)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Suhansanu Kumar ORCID: orcid.org/0000-0003-3637-9350¹,
Heting Gao¹,
Changyu Wang¹,
Kevin Chen-Chuan Chang¹ &
…
Hari Sundaram¹

624 Accesses
3 Citations
Explore all metrics

Abstract

Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness (De Choudhury et al. in Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369, https://doi.org/10.1145/2998181.2998220, 2017) and jazz musicians (Heckathorn and Jeffri in Poetics 28(4):307, 2001). Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as ‘hidden’ and the population as hidden population. Our objective in this paper is to sample such target populations that satisfy a certain property that can be verified by an oracle (typically a pre-trained classifier). However, the property itself cannot be a part of the query. We refer to such populations which are not directly queryable via the API as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. Further, limited API calls, a limited number of queryable attributes to explore from and an exponential query space to select a query from make the problem very challenging. To address the challenges, we model the search for hidden population search as a decision problem. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a decision-tree-based Thompson sampler (DT-TMP) that efficiently discovers the right combination of attributes to query. Additionally, DT-TMP alleviates the problem of exploring the exponentially large query space through a hierarchical ordering of the queries. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example, by 54% in Twitter. When the number of matching entities to a query is known in offline experiments, DT-TMP performs exceedingly well by a factor of 0.9–1.5 \(\times\) over the baseline samplers. In future, we wish to explore the option of finding hidden populations by formulating more complex queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing hidden populations online: topic, emotion, and social network of HIV-related users in the largest Chinese online community

Article Open access 05 January 2018

Towards a standard sampling methodology on online social networks: collecting global trends on Twitter

Article Open access 01 June 2016

A customisable pipeline for the semi-automated discovery of online activists and social campaigns on Twitter

Article Open access 11 June 2021

Notes

References

Agrawal, S. & Goyal, N.(2013). Further Optimal Regret Bounds for Thompson Sampling. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 31:99-107
Google Scholar
Álvarez M, Raposo J, Pan A, Cacheda F, Bellas F, Carneiro V (2007) International conference on computational science and its applications. In: International conference on computational science and its applications. Springer, Berlin, pp 322–333
Bousquet O, Boucheron S, Lugosi G (2003) Introduction to statistical learning theory. In: Summer school on machine learning. Springer, pp 169–207
Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(11):1623
Article Google Scholar
Dasgupta A, Das G, Mannila H (2007) A random walk approach to sampling hidden databases. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (ACM), pp 629–640
De Choudhury M, Gamon M, Counts S, Horvitz E (2013) Predicting depression via social media. In: Seventh international AAAI conference on weblogs and social media. 2013.
De Choudhury M, Sharma SS, Logar T, Eekhout W, Nielsen RC (2017) Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369. https://doi.org/10.1145/2998181.2998220
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. In: International conference on computational learning theory. Springer, pp 255–270
Hall BH, Jaffe AB, Trajtenberg M (2001) The nber patent citation data file: lessons, insights and methodological tools. Technical report, National Bureau of Economic Research
Heckathorn DD, Jeffri J (2001) Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28(4):307
Article Google Scholar
Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M, Ur S (1998) The shark-search algorithm. An application: tailored Web site mapping. Comput Netw ISDN Syst 30(1):317
Article Google Scholar
Imdb dataset. https://www.imdb.com/interfaces/. Accessed: 11 Dec 2019
Inc G (2017) X-armed bandits. Celebrating nine years of github with an anniversary sale
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237
Article Google Scholar
Karimi H, Tang J, Li Y (2018) Toward end-to-end deception detection in videos. In: IEEE international conference on big data (IEEE), pp 1278–1283
Kohavi R (1996) Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. Kdd 96 (Citeseer), 96:202–207
Google Scholar
Komiyama J, Honda J, Nakagawa J (2015) Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv:1506.00779
Kumar S, Gao H, Wang C, Chang KCC, Sundaram H (2019) Hierarchical multi-armed bandits for discovering hidden populations. In: ASONAM
Larson RR (2010) Introduction to information retrieval. J Am Soc Inform Sci Technol 61(4):852
Google Scholar
Liakos P, Ntoulas A, Labrinidis A, Delis A (2016) Focused crawling for the hidden web. World Wide Web 19(4):605
Article Google Scholar
Li C, Resnick P, Mei Q (2016) Multiple queries as bandit arms. In: Proceedings of the 25th ACM international on conference on information and knowledge management (ACM), pp 1089–1098
Malekinejad M, Johnston LG, Kendall C, Kerr LRFS, Rifkin MR, Rutherford GW (2008) Using respondent-driven sampling methodology for HIV biological and behavioral surveillance in international settings: a systematic review. AIDS Behav 12(1):105
Article Google Scholar
Medina AM, Mohri M (2014) PAC bounds for multi-armed bandit and Markov decision processes. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 262–270
Menczer F, Belew RK (2000) Adaptive retrieval agents: internalizing local context and scaling up to the Web. Mach Learn 39(2):203
Article MATH Google Scholar
Mullen L, Blevins C, Schmidt B (2015) Gender: predict gender from names using historical data. R package version 0.5 1
Nazi A, Thirumuruganathan S, Hristidis V, Zhang N, Das G (2015) Querying hidden attributes in an online community network.in Mobile Ad Hoc and Sensor Systems (MASS). In: 2015 IEEE 12th international conference on (IEEE), pp 657–662
Olston C, Najork M (2010) Web crawling. Foundations and trends\({\text{\textregistered} }\) Inf Retr 4(3):175
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81
Article Google Scholar
Raghavan S, Garcia-Molina H (2000) Crawling the hidden web. Technical report, Stanford
Raisi E, Huang B (2017) Cyberbullying detection with weakly supervised machine learning. In: ASONAM (ACM), pp 409–416
Rieh SY et al (2006) Analysis of multiple query reformulations on the web: the interactive information retrieval context. Inf Process Manag 42(3):751
Article Google Scholar
Robinson D (2017) Introduction to empirical bayes: examples from baseball statistics
Sheng C, Zhang N, Tao Y, Jin X (2012) Optimal algorithms for crawling a hidden database in the web. Proc VLDB Endow 5(11):1112
Article Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285
Article MATH Google Scholar
Tsugawa S, Kikuchi Y, Kishino F, Nakajima K, Itoh Y, Ohsaki H (2015) Recognizing depression from twitter activity. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems (ACM), pp 3187–3196
Twitter Inc (2018). Annual report. http://bit.ly/2LhgDEc
Wu P, Wen JR, Liu H, Ma WY (2006) Query selection techniques for efficient crawling of structured web sources. In: Data engineering, 2006. ICDE’06. Proceedings of the 22nd international conference on (IEEE), pp 47–47
Zheng Q, Wu Z, Cheng X, Jiang L, Liu J (2013) Learning to crawl deep web. Inf Syst 38(6):801
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois, Urbana Champaign, Urbana, USA
Suhansanu Kumar, Heting Gao, Changyu Wang, Kevin Chen-Chuan Chang & Hari Sundaram

Authors

Suhansanu Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Heting Gao
View author publications
You can also search for this author in PubMed Google Scholar
Changyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Chen-Chuan Chang
View author publications
You can also search for this author in PubMed Google Scholar
Hari Sundaram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suhansanu Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, S., Gao, H., Wang, C. et al. Decision tree Thompson sampling for mining hidden populations through attributed search. Soc. Netw. Anal. Min. 12, 6 (2022). https://doi.org/10.1007/s13278-021-00812-5

Download citation

Received: 14 January 2020
Revised: 12 June 2021
Accepted: 11 July 2021
Published: 15 November 2021
DOI: https://doi.org/10.1007/s13278-021-00812-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Decision tree Thompson sampling for mining hidden populations through attributed search

Abstract

Access this article

Similar content being viewed by others

Analyzing hidden populations online: topic, emotion, and social network of HIV-related users in the largest Chinese online community

Towards a standard sampling methodology on online social networks: collecting global trends on Twitter

A customisable pipeline for the semi-automated discovery of online activists and social campaigns on Twitter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Decision tree Thompson sampling for mining hidden populations through attributed search

Abstract

Access this article

Similar content being viewed by others

Analyzing hidden populations online: topic, emotion, and social network of HIV-related users in the largest Chinese online community

Towards a standard sampling methodology on online social networks: collecting global trends on Twitter

A customisable pipeline for the semi-automated discovery of online activists and social campaigns on Twitter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation