Skip to main content

Tutorial: Supervised Learning for Prevalence Estimation

  • Conference paper
  • First Online:
Flexible Query Answering Systems (FQAS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11529))

Included in the following conference series:

  • 657 Accesses

Abstract

Quantification is the task of estimating, given a set \(\sigma \) of unlabelled items and a set of classes \(\mathcal {C}\), the relative frequency (or “prevalence”) \(p(c_{i})\) of each class \(c_{i}\in \mathcal {C}\). Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidemiology) which usually deal with aggregate (as opposed to individual) data. In these contexts, classifying individual unlabelled instances is usually not a primary goal, while estimating the prevalence of the classes of interest in the data is. Quantification may in principle be solved via classification, i.e., by classifying each item in \(\sigma \) and counting, for all \(c_{i}\in \mathcal {C}\), how many such items have been labelled with \(c_{i}\). However, it has been shown in a multitude of works that this “classify and count” (CC) method yields suboptimal quantification accuracy, one of the reasons being that most classifiers are optimized for classification accuracy, and not for quantification accuracy. As a result, quantification has come to be no longer considered a mere byproduct of classification, and has evolved as a task of its own, devoted to designing methods and algorithms that deliver better prevalence estimates than CC. The goal of this tutorial is to introduce the main supervised learning techniques that have been proposed for solving quantification, the metrics used to evaluate them, and the most promising directions for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48(2), 591–604 (2015). https://doi.org/10.1016/j.patcog.2014.07.032

    Article  MATH  Google Scholar 

  2. Barranquero, J., González, P., Díez, J., del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recogn. 46(2), 472–482 (2013)

    Article  Google Scholar 

  3. Bella, A., Ferri, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Quantification via probability estimators. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010), Sydney, AU, pp. 737–742 (2010)

    Google Scholar 

  4. Da San Martino, G., Gao, W., Sebastiani, F.: Ordinal text quantification. In: Proceedings of the 39th ACM Conference on Research and Development in Information Retrieval (SIGIR 2016), Pisa, IT, pp. 937–940 (2016)

    Google Scholar 

  5. du Plessis, M.C., Niu, G., Sugiyama, M.: Class-prior estimation for learning from positive and unlabeled data. Mach. Learn. 106(4), 463–492 (2017)

    Article  MathSciNet  Google Scholar 

  6. Esuli, A., Moreo, A., Sebastiani, F.: Cross-lingual sentiment quantification (2019). arXiv:1904.07965

  7. Esuli, A., Sebastiani, F.: Optimizing text quantifiers for multivariate loss functions. ACM Trans. Knowl. Discov. Data 9(4), Article ID 27 (2015)

    Article  Google Scholar 

  8. Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008)

    Article  MathSciNet  Google Scholar 

  9. Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(19), 1–22 (2016)

    Google Scholar 

  10. González, P., Castaño, A., Chawla, N.V., del Coz, J.J.: A review on quantification learning. ACM Comput. Surv. 50(5), 74:1–74:40 (2017)

    Article  Google Scholar 

  11. González-Castro, V., Alaiz-Rodríguez, R., Alegre, E.: Class distribution estimation based on the Hellinger distance. Inf. Sci. 218, 146–164 (2013)

    Article  Google Scholar 

  12. Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54(1), 229–247 (2010)

    Article  Google Scholar 

  13. Kar, P., Li, S., Narasimhan, H., Chawla, S., Sebastiani, F.: Online optimization methods for the quantification problem. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, US, pp. 1625–1634 (2016)

    Google Scholar 

  14. Maletzke, A.G., dos Reis, D.M., Batista, G.E.: Combining instance selection and self-training to improve data stream quantification. J. Braz. Comput. Soc. 24(12), 43–48 (2018)

    Google Scholar 

  15. Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013), Dallas, US, pp. 528–536 (2013)

    Google Scholar 

  16. Milli, L., Monreale, A., Rossetti, G., Pedreschi, D., Giannotti, F., Sebastiani, F.: Quantification in social networks. In: Proceedings of the 2nd IEEE International Conference on Data Science and Advanced Analytics (DSAA 2015), Paris, FR (2015)

    Google Scholar 

  17. Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012)

    Article  Google Scholar 

  18. Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002)

    Article  Google Scholar 

  19. Sebastiani, F.: Evaluation measures for quantification: an axiomatic approach. Inf. Retrieval J. (2019, to appear)

    Google Scholar 

  20. Tang, L., Gao, H., Liu, H.: Network quantification despite biased labels. In: Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG 2010), Washington, US, pp. 147–154 (2010)

    Google Scholar 

  21. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabrizio Sebastiani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moreo, A., Sebastiani, F. (2019). Tutorial: Supervised Learning for Prevalence Estimation. In: Cuzzocrea, A., Greco, S., Larsen, H., Saccà, D., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2019. Lecture Notes in Computer Science(), vol 11529. Springer, Cham. https://doi.org/10.1007/978-3-030-27629-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27629-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27628-7

  • Online ISBN: 978-3-030-27629-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics