Threshold Selection for Classification with Skewed Class Distribution

He, Xiaofeng; Zhang, Rong; Zhou, Aoying

doi:10.1007/978-3-642-39527-7_37

Xiaofeng He²⁴,
Rong Zhang²⁴ &
Aoying Zhou²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7901))

Included in the following conference series:

International Conference on Web-Age Information Management

1555 Accesses
1 Citations

Abstract

In document classification, threshold selection receives little attention, particularly in binary classification cases where threshold selection was largely ignored as a trivial task of a post-processing step. In webpage classification, however, we are facing a problem involving huge number of webpages usually with highly imbalanced class distribution. Due to the budget constraint, a reliable estimate of the threshold is required on only small size of human judged webpages. A good threshold selection criterion also need be adopted in highly imbalanced class distribution situation with positives being very spares in the sample set. These challenges make the threshold selection a non-trivial task for webpage classification. In this paper, we propose a novel cost efficient approach of threshold selection method for binary webpage classification with highly imbalanced class distribution. We construct a small sample set by applying stratified sampling on the webpages. The human judged samples are expanded to reflect the true class distribution of the webpage population. Experimental results show that false positive rate leads to more stable threshold estimate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bennett, P.N.: Assessing the calibration of naive bayes posterior estimates. Tech. rep., Computer science department, school of computer science, CMU (2000)
Google Scholar
Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons Inc., New York (1977)
MATH Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1996)
Article Google Scholar
Egan, J.P.: Signal Detection Theory and Roc Analysis. Academic Press, New York (1975)
Google Scholar
Fawcett, T.: Draft roc graphs: Notes and practical considerations for data mining researchers (2003)
Google Scholar
Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185. Morgan Kaufmann Publishers (2001)
Google Scholar
Gvert, N., Lalmas, M., Fuhr, N.: A probabillistic description-oriented approach for categorising web documents. In: Proceedings of CIKM 1999, 8th ACM International Conference on Information and Knowledge Management, pp. 475–482 (1999)
Google Scholar
Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category serch strategies. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 273–281. ACM Press (1995)
Google Scholar
Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 90–95 (1998)
Google Scholar
Levy, P.S., Lemeshow, S.: Sampling of Populations: Methods and Applications, 3rd edn. John Wiley & Sons Inc., New York (1999)
MATH Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, pp. 37–50. ACM Press (1992)
Google Scholar
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 246–254. ACM Press (1995)
Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)
Google Scholar
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 215–223. ACM Press (1998)
Google Scholar
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Spackman, K.A.: Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of 6th International Workshop on Machine Learning, pp. 160–163. Morgan Kaufman (1989)
Google Scholar
Swets, J.A., Dawes, R.M., Monahan, J.: Context-sensitive learning methods for text categorization, pp. 82–87. Scientific American (October 2000)
Google Scholar
Thompson, S.K.: Sampling, 2nd edn. John Wiley & Sons Inc., New York (2002)
MATH Google Scholar
Tie-Yan Liu, Yiming Yang, H.W.H.J.Z.Z.C., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. In: ACM SIGKDD Explorations Newsletter - Natural Language Processing and Text Mining, vol. 7, pp. 36–43. ACM Press (June 2005)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)
Article Google Scholar
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)
Google Scholar
Yang, Y., Slattery, S.A.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Institute, East China Normal University, Shanghai, China, 200062
Xiaofeng He, Rong Zhang & Aoying Zhou

Authors

Xiaofeng He
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, China
Yunjun Gao
Seoul National University, Seoul, Korea
Kyuseok Shim
Institute of Software, Chinese Academy of Sciences, South-Fourth-Street 4, Zhong-Guan-Cun, 100190, Beijing, P.R. China
Zhiming Ding
School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China
Peiquan Jin
School of Computer Science and Technology, Hangzhou Dianzi University, 310018, Hangzhou, China
Zujie Ren
Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, 300384, Tianjin, China
Yingyuan Xiao
CityU-USTC Advanced Research Institute, Suzhou, China
An Liu
School of Information Science and Technology, Southwest Jiaotong University, 610031, Chengdu, China
Shaojie Qiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, X., Zhang, R., Zhou, A. (2013). Threshold Selection for Classification with Skewed Class Distribution. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-39527-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39526-0
Online ISBN: 978-3-642-39527-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics