Skip to main content

Advertisement

Log in

Decision tree Thompson sampling for mining hidden populations through attributed search

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness (De Choudhury et al. in Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369, https://doi.org/10.1145/2998181.2998220, 2017) and jazz musicians (Heckathorn and Jeffri in Poetics 28(4):307, 2001). Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as ‘hidden’ and the population as hidden population. Our objective in this paper is to sample such target populations that satisfy a certain property that can be verified by an oracle (typically a pre-trained classifier). However, the property itself cannot be a part of the query. We refer to such populations which are not directly queryable via the API as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. Further, limited API calls, a limited number of queryable attributes to explore from and an exponential query space to select a query from make the problem very challenging. To address the challenges, we model the search for hidden population search as a decision problem. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a decision-tree-based Thompson sampler (DT-TMP) that efficiently discovers the right combination of attributes to query. Additionally, DT-TMP alleviates the problem of exploring the exponentially large query space through a hierarchical ordering of the queries. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example, by 54% in Twitter. When the number of matching entities to a query is known in offline experiments, DT-TMP performs exceedingly well by a factor of 0.9–1.5 \(\times\) over the baseline samplers. In future, we wish to explore the option of finding hidden populations by formulating more complex queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/orgesleka/used-cars-database.

  2. https://www.usatoday.com/sports/nfl/rankings/.

  3. https://developer.twitter.com/en/docs.

  4. http://githut.info/.

References

  • Agrawal, S. & Goyal, N.(2013). Further Optimal Regret Bounds for Thompson Sampling. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 31:99-107

    Google Scholar 

  • Álvarez M, Raposo J, Pan A, Cacheda F, Bellas F, Carneiro V (2007) International conference on computational science and its applications. In: International conference on computational science and its applications. Springer, Berlin, pp 322–333

  • Bousquet O, Boucheron S, Lugosi G (2003) Introduction to statistical learning theory. In: Summer school on machine learning. Springer, pp 169–207

  • Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(11):1623

    Article  Google Scholar 

  • Dasgupta A, Das G, Mannila H (2007) A random walk approach to sampling hidden databases. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (ACM), pp 629–640

  • De Choudhury M, Gamon M, Counts S, Horvitz E (2013) Predicting depression via social media. In: Seventh international AAAI conference on weblogs and social media. 2013.

  • De Choudhury M, Sharma SS, Logar T, Eekhout W, Nielsen RC (2017) Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369. https://doi.org/10.1145/2998181.2998220

  • Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. In: International conference on computational learning theory. Springer, pp 255–270

  • Hall BH, Jaffe AB, Trajtenberg M (2001) The nber patent citation data file: lessons, insights and methodological tools. Technical report, National Bureau of Economic Research

  • Heckathorn DD, Jeffri J (2001) Finding the beat: using respondent-driven sampling to study jazz musicians. Poetics 28(4):307

    Article  Google Scholar 

  • Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M, Ur S (1998) The shark-search algorithm. An application: tailored Web site mapping. Comput Netw ISDN Syst 30(1):317

    Article  Google Scholar 

  • Imdb dataset. https://www.imdb.com/interfaces/. Accessed: 11 Dec 2019

  • Inc G (2017) X-armed bandits. Celebrating nine years of github with an anniversary sale

  • Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237

    Article  Google Scholar 

  • Karimi H, Tang J, Li Y (2018) Toward end-to-end deception detection in videos. In: IEEE international conference on big data (IEEE), pp 1278–1283

  • Kohavi R (1996) Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid. Kdd 96 (Citeseer), 96:202–207

    Google Scholar 

  • Komiyama J, Honda J, Nakagawa J (2015) Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv:1506.00779

  • Kumar S, Gao H, Wang C, Chang KCC, Sundaram H (2019) Hierarchical multi-armed bandits for discovering hidden populations. In: ASONAM

  • Larson RR (2010) Introduction to information retrieval. J Am Soc Inform Sci Technol 61(4):852

    Google Scholar 

  • Liakos P, Ntoulas A, Labrinidis A, Delis A (2016) Focused crawling for the hidden web. World Wide Web 19(4):605

    Article  Google Scholar 

  • Li C, Resnick P, Mei Q (2016) Multiple queries as bandit arms. In: Proceedings of the 25th ACM international on conference on information and knowledge management (ACM), pp 1089–1098

  • Malekinejad M, Johnston LG, Kendall C, Kerr LRFS, Rifkin MR, Rutherford GW (2008) Using respondent-driven sampling methodology for HIV biological and behavioral surveillance in international settings: a systematic review. AIDS Behav 12(1):105

    Article  Google Scholar 

  • Medina AM, Mohri M (2014) PAC bounds for multi-armed bandit and Markov decision processes. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 262–270

  • Menczer F, Belew RK (2000) Adaptive retrieval agents: internalizing local context and scaling up to the Web. Mach Learn 39(2):203

    Article  MATH  Google Scholar 

  • Mullen L, Blevins C, Schmidt B (2015) Gender: predict gender from names using historical data. R package version 0.5 1

  • Nazi A, Thirumuruganathan S, Hristidis V, Zhang N, Das G (2015) Querying hidden attributes in an online community network.in Mobile Ad Hoc and Sensor Systems (MASS). In: 2015 IEEE 12th international conference on (IEEE), pp 657–662

  • Olston C, Najork M (2010) Web crawling. Foundations and trends\({\text{\textregistered} }\) Inf Retr 4(3):175

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81

    Article  Google Scholar 

  • Raghavan S, Garcia-Molina H (2000) Crawling the hidden web. Technical report, Stanford

  • Raisi E, Huang B (2017) Cyberbullying detection with weakly supervised machine learning. In: ASONAM (ACM), pp 409–416

  • Rieh SY et al (2006) Analysis of multiple query reformulations on the web: the interactive information retrieval context. Inf Process Manag 42(3):751

    Article  Google Scholar 

  • Robinson D (2017) Introduction to empirical bayes: examples from baseball statistics

  • Sheng C, Zhang N, Tao Y, Jin X (2012) Optimal algorithms for crawling a hidden database in the web. Proc VLDB Endow 5(11):1112

    Article  Google Scholar 

  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285

    Article  MATH  Google Scholar 

  • Tsugawa S, Kikuchi Y, Kishino F, Nakajima K, Itoh Y, Ohsaki H (2015) Recognizing depression from twitter activity. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems (ACM), pp 3187–3196

  • Twitter Inc (2018). Annual report. http://bit.ly/2LhgDEc

  • Wu P, Wen JR, Liu H, Ma WY (2006) Query selection techniques for efficient crawling of structured web sources. In: Data engineering, 2006. ICDE’06. Proceedings of the 22nd international conference on (IEEE), pp 47–47

  • Zheng Q, Wu Z, Cheng X, Jiang L, Liu J (2013) Learning to crawl deep web. Inf Syst 38(6):801

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suhansanu Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, S., Gao, H., Wang, C. et al. Decision tree Thompson sampling for mining hidden populations through attributed search. Soc. Netw. Anal. Min. 12, 6 (2022). https://doi.org/10.1007/s13278-021-00812-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-021-00812-5

Keywords

Navigation