Abstract
In document classification, threshold selection receives little attention, particularly in binary classification cases where threshold selection was largely ignored as a trivial task of a post-processing step. In webpage classification, however, we are facing a problem involving huge number of webpages usually with highly imbalanced class distribution. Due to the budget constraint, a reliable estimate of the threshold is required on only small size of human judged webpages. A good threshold selection criterion also need be adopted in highly imbalanced class distribution situation with positives being very spares in the sample set. These challenges make the threshold selection a non-trivial task for webpage classification. In this paper, we propose a novel cost efficient approach of threshold selection method for binary webpage classification with highly imbalanced class distribution. We construct a small sample set by applying stratified sampling on the webpages. The human judged samples are expanded to reflect the true class distribution of the webpage population. Experimental results show that false positive rate leads to more stable threshold estimate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bennett, P.N.: Assessing the calibration of naive bayes posterior estimates. Tech. rep., Computer science department, school of computer science, CMU (2000)
Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons Inc., New York (1977)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1996)
Egan, J.P.: Signal Detection Theory and Roc Analysis. Academic Press, New York (1975)
Fawcett, T.: Draft roc graphs: Notes and practical considerations for data mining researchers (2003)
Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185. Morgan Kaufmann Publishers (2001)
Gvert, N., Lalmas, M., Fuhr, N.: A probabillistic description-oriented approach for categorising web documents. In: Proceedings of CIKM 1999, 8th ACM International Conference on Information and Knowledge Management, pp. 475–482 (1999)
Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category serch strategies. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 273–281. ACM Press (1995)
Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 90–95 (1998)
Levy, P.S., Lemeshow, S.: Sampling of Populations: Methods and Applications, 3rd edn. John Wiley & Sons Inc., New York (1999)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, pp. 37–50. ACM Press (1992)
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of SIGIR 1995, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 246–254. ACM Press (1995)
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 215–223. ACM Press (1998)
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Spackman, K.A.: Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of 6th International Workshop on Machine Learning, pp. 160–163. Morgan Kaufman (1989)
Swets, J.A., Dawes, R.M., Monahan, J.: Context-sensitive learning methods for text categorization, pp. 82–87. Scientific American (October 2000)
Thompson, S.K.: Sampling, 2nd edn. John Wiley & Sons Inc., New York (2002)
Tie-Yan Liu, Yiming Yang, H.W.H.J.Z.Z.C., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. In: ACM SIGKDD Explorations Newsletter - Natural Language Processing and Text Mining, vol. 7, pp. 36–43. ACM Press (June 2005)
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)
Yang, Y., Slattery, S.A.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18, 219–241 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
He, X., Zhang, R., Zhou, A. (2013). Threshold Selection for Classification with Skewed Class Distribution. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-39527-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39526-0
Online ISBN: 978-3-642-39527-7
eBook Packages: Computer ScienceComputer Science (R0)