Skip to main content

Threshold Selection for Classification with Skewed Class Distribution

  • Conference paper
Web-Age Information Management (WAIM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7901))

Included in the following conference series:

Abstract

In document classification, threshold selection receives little attention, particularly in binary classification cases where threshold selection was largely ignored as a trivial task of a post-processing step. In webpage classification, however, we are facing a problem involving huge number of webpages usually with highly imbalanced class distribution. Due to the budget constraint, a reliable estimate of the threshold is required on only small size of human judged webpages. A good threshold selection criterion also need be adopted in highly imbalanced class distribution situation with positives being very spares in the sample set. These challenges make the threshold selection a non-trivial task for webpage classification. In this paper, we propose a novel cost efficient approach of threshold selection method for binary webpage classification with highly imbalanced class distribution. We construct a small sample set by applying stratified sampling on the webpages. The human judged samples are expanded to reflect the true class distribution of the webpage population. Experimental results show that false positive rate leads to more stable threshold estimate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bennett, P.N.: Assessing the calibration of naive bayes posterior estimates. Tech. rep., Computer science department, school of computer science, CMU (2000)

    Google Scholar 

  2. Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons Inc., New York (1977)

    MATH  Google Scholar 

  3. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1996)

    Article  Google Scholar 

  4. Egan, J.P.: Signal Detection Theory and Roc Analysis. Academic Press, New York (1975)

    Google Scholar 

  5. Fawcett, T.: Draft roc graphs: Notes and practical considerations for data mining researchers (2003)

    Google Scholar 

  6. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185. Morgan Kaufmann Publishers (2001)

    Google Scholar 

  7. Gvert, N., Lalmas, M., Fuhr, N.: A probabillistic description-oriented approach for categorising web documents. In: Proceedings of CIKM 1999, 8th ACM International Conference on Information and Knowledge Management, pp. 475–482 (1999)

    Google Scholar 

  8. Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category serch strategies. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 273–281. ACM Press (1995)

    Google Scholar 

  9. Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 90–95 (1998)

    Google Scholar 

  10. Levy, P.S., Lemeshow, S.: Sampling of Populations: Methods and Applications, 3rd edn. John Wiley & Sons Inc., New York (1999)

    MATH  Google Scholar 

  11. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, pp. 37–50. ACM Press (1992)

    Google Scholar 

  12. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 246–254. ACM Press (1995)

    Google Scholar 

  13. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)

    Google Scholar 

  14. Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 215–223. ACM Press (1998)

    Google Scholar 

  15. Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  Google Scholar 

  16. Spackman, K.A.: Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of 6th International Workshop on Machine Learning, pp. 160–163. Morgan Kaufman (1989)

    Google Scholar 

  17. Swets, J.A., Dawes, R.M., Monahan, J.: Context-sensitive learning methods for text categorization, pp. 82–87. Scientific American (October 2000)

    Google Scholar 

  18. Thompson, S.K.: Sampling, 2nd edn. John Wiley & Sons Inc., New York (2002)

    MATH  Google Scholar 

  19. Tie-Yan Liu, Yiming Yang, H.W.H.J.Z.Z.C., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. In: ACM SIGKDD Explorations Newsletter - Natural Language Processing and Text Mining, vol. 7, pp. 36–43. ACM Press (June 2005)

    Google Scholar 

  20. Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)

    Article  Google Scholar 

  21. Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)

    Google Scholar 

  22. Yang, Y., Slattery, S.A.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

He, X., Zhang, R., Zhou, A. (2013). Threshold Selection for Classification with Skewed Class Distribution. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39527-7_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39526-0

  • Online ISBN: 978-3-642-39527-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics