Reduction of Training Noises for Text Classifiers

Liu, Rey-Long

doi:10.1007/978-3-642-36543-0_4

Reduction of Training Noises for Text Classifiers

Rey-Long Liu²¹

Conference paper

2130 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7803))

Abstract

Automatic text classification (TC) is essential for the archiving and retrieval of texts, which are main ways of recording information and expertise. Previous studies thus have developed many text classifiers. They often employed training texts to build the classifiers, and showed that the classifiers had good performance in various application domains. However, as the training texts are often inevitably unsound or incomplete in practice, they often contain many terms not related to the categories of interest. Such terms are actually training noises in classifier training, and hence can deteriorate the performance of the classifiers. Reduction of the training noises is thus essential. It is also quite challenging as training texts are unsound or incomplete. In this paper, we develop a technique TNR (Training Noise Reduction) to remove the possible training noises so that the performance of the classifiers can be further improved. Given a training text d of a category c, TNR identifies a sequence of consecutive terms (in d) as the noises if the terms are not strongly related to c. A case study on the classification of Chinese texts of disease information shows that TNR can improve a Support Vector Machine (SVM) classifier, which is a state-of-the-art classifier in TC. The contribution is of significance to the further enhancement of existing text classifiers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abdul-Jaleel, N., Allan, J., Croft, W.B., Diaz, F., Larkey, L., Li, X., Metzler, D., Smucker, M.D., Strohman, T., Turtle, H., Wade, C.: UMass at TREC 2004: Notebook. In: Proceedings of the 13th Text Retrieval Conference. National Institute of Standards and Technology, Gaithersburg (2004)
Google Scholar
Chen, C.C., Chen, M.C.: TSCAN: A Novel Method for Topic Summarization and Content Anatomy. In: Proceedings of SIGIR 2008, Singapore, pp. 579–586 (2008)
Google Scholar
Cohen, W.W., Singer, Y.: Context-Sensitive Mining Methods for Text Categorization. In: 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996), Zurich, Switzerland, pp. 307–315 (1996)
Google Scholar
Gerani, S., Carman, M.J., Crestani, F.: Proximity-Based Opinion Retrieval. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2010), Geneva, Switzerland, pp. 403–410 (2010)
Google Scholar
Joachims, T.: Making Large-Scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press (1999)
Google Scholar
Kim, J., Kim, M.H.: An Evaluation of Passage-Based Text Categorization. Journal of Intelligent Information Systems 23(1), 47–65 (2004)
Article MATH Google Scholar
Mengle, S., Goharian, N.: Passage Detection Using Text Classification. Journal of the American Society for Information Science and Technology 60(4), 814–825 (2009)
Article Google Scholar
Mladeniá, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature Selection Using Linear Classifier Weights: Interaction with Classification Models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234–241 (2004)
Google Scholar
Peng, F., Schuurmans, D.: Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Chapter Google Scholar
Svore, K.M., Kanani, P.H., Khan, N.: How Good is a Span of Terms? Exploiting Proximity to Improve Web Retrieval. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, pp. 154–161 (2010)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (1997), Tennessee, Nashville, pp. 412–420 (1997)
Google Scholar
Zhao, J., Yun, Y.: A Proximity Language Model for Information Retrieval. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2009), Boston, USA, pp. 291–298 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan
Rey-Long Liu

Authors

Rey-Long Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Systems, Department of Software Engineering, Universiti Teknologi Malaysia, 81310, Johar Baharu, Johor, Malaysia
Ali Selamat & Habibollah Haron &
Institute of Informatics, Division of Knowledge Managements Systems, Wrocław University of Technology, Str. Wybrzeże Wyspiańskiego 27, 50-370, Wrocław, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, RL. (2013). Reduction of Training Noises for Text Classifiers. In: Selamat, A., Nguyen, N.T., Haron, H. (eds) Intelligent Information and Database Systems. ACIIDS 2013. Lecture Notes in Computer Science(), vol 7803. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36543-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-36543-0_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36542-3
Online ISBN: 978-3-642-36543-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics