PLD: A Distillation Algorithm for Misclassified Documents

Chen, Ding-Yi; Li, Xue

doi:10.1007/978-3-540-27772-9_50

Ding-Yi Chen¹⁸ &
Xue Li¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

International Conference on Web-Age Information Management

880 Accesses
1 Citations

Abstract

We observed that in interactive text classification, user tends to point out only the misclassified documents, not the correct ones. It is unlikely that a user would be diligent enough to identify all the misclassified documents. In this case, a classifier is expected to deal with misclassified documents. Among them it is possible that only a small proportion has been identified. We propose the Prediction-Learning-Distillation (PLD) framework for distilling the misclassified documents. Whenever a user points out an error, the PLD learns from the mistake and identifies the same mistake from all other classified documents. The PLD then enforces this learning for future classifications. Our experiment results have demonstrated that the proposed algorithm can learn from user identified misclassified documents, then distills the rest successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, United States, pp. 307–318. ACM Press, New York (1998)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Article MathSciNet Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory, Madison, Wisconsin, United States, ACM Press, New York (1998)
Google Scholar
Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, DE, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Tresch, M., Luniewski, A.: An extensible classifier for semi-structured documents. In: Proceedings of the fourth international conference on Information and knowledge management, Baltimore, Maryland, United States, pp. 226–233. ACM Press, New York (1995)
Google Scholar
Tax, D.M., Laskov, P.: Online svm learning: From classification to data description and back. In: IEEE International Workshop on Neural Networks for Signal Processing (NNSP), Toulouse France, pp. 499–508 (2003)
Google Scholar
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Yu, H., Han, J., Chang, K.C.C.: Pebl: positive example based learning for web page classification using svm. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp. 239–248. ACM Press, New York (2002)
Chapter Google Scholar
Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, United States, pp. 229–237. ACM Press, New York (1995)
Chapter Google Scholar
Yamazaki, T., Dagan, I.: Mistake-driven learning with thesaurus for text categorization. In: Proceedings of NLPRS-97, the Natural Language Processing Pacific Rim Symposium, Phuket, TH, pp. 369–374 (1997)
Google Scholar
Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. In: Advances in Neural Information Processing Systems, pp. 409–415 (2000)
Google Scholar
Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. In: Advances in Neural Information Processing Systems, pp. 785–792 (2001)
Google Scholar
Ralaivola, L., d’Alché-Buc, F.: Incremental support vector machine learning: A local approach. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001. LNCS, vol. 2130, pp. 322–330. Springer, Heidelberg (2001)
Chapter Google Scholar
Syed, N.A., Liu, H., Sung, K.K.: Incremental learning with support vector machines. In: Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Articial Intelligence (IJCAI 1999), Stockholm, Sweden (1999)
Google Scholar
Van Rijsbergen, C.J.: Evaluation. In: Dept. of Computer Science, University of Glasgow. Department of Computer Science, University of Glasgow (1979)
Google Scholar
Lewis, D.D.: Reuters corpus (21578) (2000)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Australia
Ding-Yi Chen & Xue Li

Authors

Ding-Yi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xue Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Shenyang Liaoning, Northeastern University, 110004, China
Guoren Wang
Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, DY., Li, X. (2004). PLD: A Distillation Algorithm for Misclassified Documents. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_50

Download citation

DOI: https://doi.org/10.1007/978-3-540-27772-9_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics