Abstract
Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness (De Choudhury et al. in Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369, https://doi.org/10.1145/2998181.2998220, 2017) and jazz musicians (Heckathorn and Jeffri in Poetics 28(4):307, 2001). Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as ‘hidden’ and the population as hidden population. Our objective in this paper is to sample such target populations that satisfy a certain property that can be verified by an oracle (typically a pre-trained classifier). However, the property itself cannot be a part of the query. We refer to such populations which are not directly queryable via the API as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. Further, limited API calls, a limited number of queryable attributes to explore from and an exponential query space to select a query from make the problem very challenging. To address the challenges, we model the search for hidden population search as a decision problem. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a decision-tree-based Thompson sampler (DT-TMP) that efficiently discovers the right combination of attributes to query. Additionally, DT-TMP alleviates the problem of exploring the exponentially large query space through a hierarchical ordering of the queries. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example, by 54% in Twitter. When the number of matching entities to a query is known in offline experiments, DT-TMP performs exceedingly well by a factor of 0.9–1.5 \(\times\) over the baseline samplers. In future, we wish to explore the option of finding hidden populations by formulating more complex queries.
Similar content being viewed by others
References
Agrawal, S. & Goyal, N.(2013). Further Optimal Regret Bounds for Thompson Sampling. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 31:99-107
Álvarez M, Raposo J, Pan A, Cacheda F, Bellas F, Carneiro V (2007) International conference on computational science and its applications. In: International conference on computational science and its applications. Springer, Berlin, pp 322–333
Bousquet O, Boucheron S, Lugosi G (2003) Introduction to statistical learning theory. In: Summer school on machine learning. Springer, pp 169–207
Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(11):1623
Dasgupta A, Das G, Mannila H (2007) A random walk approach to sampling hidden databases. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (ACM), pp 629–640
De Choudhury M, Gamon M, Counts S, Horvitz E (2013) Predicting depression via social media. In: Seventh international AAAI conference on weblogs and social media. 2013.
De Choudhury M, Sharma SS, Logar T, Eekhout W, Nielsen RC (2017) Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369. https://doi.org/10.1145/2998181.2998220
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. In: International conference on computational learning theory. Springer, pp 255–270
Hall BH, Jaffe AB, Trajtenberg M (2001) The nber patent citation data file: lessons, insights and methodological tools. Technical report, National Bureau of Economic Research
Heckathorn DD, Jeffri J (2001) Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28(4):307
Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M, Ur S (1998) The shark-search algorithm. An application: tailored Web site mapping. Comput Netw ISDN Syst 30(1):317
Imdb dataset. https://www.imdb.com/interfaces/. Accessed: 11 Dec 2019
Inc G (2017) X-armed bandits. Celebrating nine years of github with an anniversary sale
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237
Karimi H, Tang J, Li Y (2018) Toward end-to-end deception detection in videos. In: IEEE international conference on big data (IEEE), pp 1278–1283
Kohavi R (1996) Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. Kdd 96 (Citeseer), 96:202–207
Komiyama J, Honda J, Nakagawa J (2015) Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv:1506.00779
Kumar S, Gao H, Wang C, Chang KCC, Sundaram H (2019) Hierarchical multi-armed bandits for discovering hidden populations. In: ASONAM
Larson RR (2010) Introduction to information retrieval. J Am Soc Inform Sci Technol 61(4):852
Liakos P, Ntoulas A, Labrinidis A, Delis A (2016) Focused crawling for the hidden web. World Wide Web 19(4):605
Li C, Resnick P, Mei Q (2016) Multiple queries as bandit arms. In: Proceedings of the 25th ACM international on conference on information and knowledge management (ACM), pp 1089–1098
Malekinejad M, Johnston LG, Kendall C, Kerr LRFS, Rifkin MR, Rutherford GW (2008) Using respondent-driven sampling methodology for HIV biological and behavioral surveillance in international settings: a systematic review. AIDS Behav 12(1):105
Medina AM, Mohri M (2014) PAC bounds for multi-armed bandit and Markov decision processes. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 262–270
Menczer F, Belew RK (2000) Adaptive retrieval agents: internalizing local context and scaling up to the Web. Mach Learn 39(2):203
Mullen L, Blevins C, Schmidt B (2015) Gender: predict gender from names using historical data. R package version 0.5 1
Nazi A, Thirumuruganathan S, Hristidis V, Zhang N, Das G (2015) Querying hidden attributes in an online community network.in Mobile Ad Hoc and Sensor Systems (MASS). In: 2015 IEEE 12th international conference on (IEEE), pp 657–662
Olston C, Najork M (2010) Web crawling. Foundations and trends\({\text{\textregistered} }\) Inf Retr 4(3):175
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81
Raghavan S, Garcia-Molina H (2000) Crawling the hidden web. Technical report, Stanford
Raisi E, Huang B (2017) Cyberbullying detection with weakly supervised machine learning. In: ASONAM (ACM), pp 409–416
Rieh SY et al (2006) Analysis of multiple query reformulations on the web: the interactive information retrieval context. Inf Process Manag 42(3):751
Robinson D (2017) Introduction to empirical bayes: examples from baseball statistics
Sheng C, Zhang N, Tao Y, Jin X (2012) Optimal algorithms for crawling a hidden database in the web. Proc VLDB Endow 5(11):1112
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285
Tsugawa S, Kikuchi Y, Kishino F, Nakajima K, Itoh Y, Ohsaki H (2015) Recognizing depression from twitter activity. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems (ACM), pp 3187–3196
Twitter Inc (2018). Annual report. http://bit.ly/2LhgDEc
Wu P, Wen JR, Liu H, Ma WY (2006) Query selection techniques for efficient crawling of structured web sources. In: Data engineering, 2006. ICDE’06. Proceedings of the 22nd international conference on (IEEE), pp 47–47
Zheng Q, Wu Z, Cheng X, Jiang L, Liu J (2013) Learning to crawl deep web. Inf Syst 38(6):801
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kumar, S., Gao, H., Wang, C. et al. Decision tree Thompson sampling for mining hidden populations through attributed search. Soc. Netw. Anal. Min. 12, 6 (2022). https://doi.org/10.1007/s13278-021-00812-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-021-00812-5