Skip to main content
Log in

Effective social post classifiers on top of search interfaces

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Applying text classification to find social media posts relevant to a topic of interest is the focus of a substantial amount of research. A key challenge is how to select a good training set of posts to label. This problem has traditionally been solved using active learning. However, this assumes access to all posts of the collection, which is not realistic in many cases, as social networks impose constraints on the number of posts that can be retrieved through their search APIs. To address this problem, which we refer as the training post retrieval over constrained search interfaces problem, we propose several keyword selection algorithms that, given a topic, generate an effective set of keyword queries to submit to the search API. The returned posts are labeled and used as a training dataset to train post classifiers. Our experiments compare our proposed keyword selection algorithms to several baselines across various topics from three sources. The results show that the proposed methods generate superior training sets, which is measured by the balanced accuracy of the trained classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability Statement

The datasets used in our experiments were collected from DailyStrength, Reddit, and Misra (2018).

Notes

  1. This value was used by the experiments in Wang et al. (2016).

References

  • Ahmad S, Asghar MZ, Alotaibi FM, Awan I (2019) Detection and classification of social media-based extremist affiliations using sentiment analysis techniques. Hum Cent Comput Inf Sci 9:24

    Article  Google Scholar 

  • Balsamo D, Bajardi P, Panisson A (2019) Firsthand opiates abuse on social media: monitoring geospatial patterns of interest through a digital cohort. Proc WWW 2019:2572–2579

    Google Scholar 

  • Bissoyi S, Mishra BK, Patra MR (2016) Recommender systems in a patient centric social network—a survey. Proc SCOPES 2016:386–389

    Google Scholar 

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–46

    Article  Google Scholar 

  • Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010) The balanced accuracy and its posterior distribution. Proc ICPR 2010:3121–3124

    Google Scholar 

  • Croft WB, Metzler D, Strohman T (2010) Search engines: information retrieval in practice. Addison-Wesley, Boston

    Google Scholar 

  • de Lira VM, Macdonald C, Ounis I, Perego R, Renso C, Times VC (2019) Event attendance classification in social media. Inform Process Manag 56(3):687–703

    Article  Google Scholar 

  • Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data. Proc SIGKDD 2008:213–220

    Google Scholar 

  • Goudjil M, Koudil M, Bedda M, Ghoggali N (2018) A novel active learning method using SVM for text classification. Int J Automat Comput 15(3):290–298

    Article  Google Scholar 

  • Kim Y (2014) Convolutional neural networks for sentence classification. Proc EMNLP 2014:1746–1751

    Google Scholar 

  • Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    Article  MathSciNet  Google Scholar 

  • Li C, Xing J, Sun A, Ma Z (2016) Effective document labeling with very few seed words: a topic model approach. Proc CIKM 2016:85–94

    Google Scholar 

  • Li C, Zhou W, Ji F, Duan Y, Chen H (2018) A deep relevance model for zero-shot document filtering. In: Proc 56th annu meeting ACL, pp 2300–2310

  • Li H, Liu B, Mukherjee A, Shao J (2014) Spotting fake reviews using positive-unlabeled learning. Comput Sist 18(3):467–475

    Google Scholar 

  • Li R, Wang S, Cheng KC (2013) Towards social data platform: automatic topic-focused monitor for Twitter stream. Proc VLDB Endow 6(14):1966–1977

    Article  Google Scholar 

  • Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. Proc IJCAI 2003:587–592

    Google Scholar 

  • Misra R (2018) News category dataset. ResearchGate. https://doi.org/10.13140/RG.2.2.20331.18729

  • Pearson K (1895) Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58(347–352):240–242

    Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Pohl D, Bouchachia A, Hellwagner H (2018) Batch-based active learning: application to social media data for crisis management. Expert Syst Appl 93:232–244

    Article  Google Scholar 

  • Proskurnia J, Mavlyutov R, Castillo C, Aberer K, Cudre-Mauroux P (2017) Efficient document filtering using vector space topic expansion and pattern-mining: the case of event detection in microposts. Proc CIKM 2017:457–466

    Google Scholar 

  • Rao J, Yang W, Zhang Y, Ture F, Lin J (2019) Multi-perspective relevance matching with hierarchical convnets for social media search. In: Proc 33rd AAAI conf artif intell, pp 232–240

  • Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proc LREC 2010 workshop new challenges NLP frameworks, pp 45–50

  • Rivas R, Sadah SA, Guo Y, Hristidis V (2020) Classification of health-related social media posts: evaluation of post content classifier models and analysis of user demographics. JMIR Pub Health Surv 6(2):e14952

    Article  Google Scholar 

  • Ruiz E, Hristidis V, Ipeirotis PG (2014) Efficient filtering on hidden document streams. In: Proc ICWSM

  • Sadri M, Mehrotra S, Yu Y (2016) Online adaptive topic focused tweet acquisition. Proc. CIKM 2016:2353–2358

    Google Scholar 

  • Shen S, Murzintcev N, Song C, Cheng C (2017) Information retrieval of a disaster event from cross-platform social media. Inf Discov Deliv 45(4):220–226

    Google Scholar 

  • Smailovic J, Grcar M, Lavrac N, Znidarsic M (2014) Stream-based active learning for sentiment analysis in the financial domain. Inf Sci 285:181–203

    Article  Google Scholar 

  • Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276

    Article  Google Scholar 

  • Wang S, Chen Z, Liu B, Emery S (2016) Identifying search keywords for finding relevant social media posts. In: Proc 30th AAAI conf artif intell, pp 3052–3058

  • Zhang Y, Lease M, Wallace BC (2017) Active discriminative text representation learning. In: Proc 31st AAAI conf artif intell, pp 3386–3392

  • Zhang Y, Zhao P, Cao J, Ma W, Huang J, Wu Q, Tan M (2018) Online adaptive asymmetric active learning for budgeted imbalanced data. Proc SIGKDD 2018:2768–2777

    Google Scholar 

Download references

Funding

This work was supported by NSF Grants IIS-1838222 and IIS-1901379.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan Rivas.

Ethics declarations

Code availability

The code used in our experiments is available from: https://github.com/rriva002/Training-Post-Retrieval.

Additional information

Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 416 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rivas, R., Hristidis, V. Effective social post classifiers on top of search interfaces. Data Min Knowl Disc 35, 1809–1829 (2021). https://doi.org/10.1007/s10618-021-00768-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00768-2

Keywords

Navigation