An Approach to Improve Text Classification Efficiency

Zhou, Shuigeng; Guan, Jihong

doi:10.1007/3-540-45710-0_6

An Approach to Improve Text Classification Efficiency

Shuigeng Zhou⁶ &
Jihong Guan⁷

Conference paper
First Online: 01 January 2002

359 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2435))

Abstract

Text classification is becoming more and more important with the rapid growth of on-line information available. In this paper, we propose an approach to speedup the process of text classification based on pruning the training corpus. Effective algorithm for text corpus pruning is designed. Experiments over real-world text corpus are carried out, which validates the effectiveness and efficiency of the proposed approach. Our approach is especially suitable for applications of on-line text classification.

This work was supported by the Provincial Natural Science Foundation of Hubei of China (No. 2001ABB050) and the Natural Science Foundation of China (NSFC) (No. 60173027).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Y. Yang and X. Liu. A re-examination of text categorization. Proceedings 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’99), 1999.
Google Scholar
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Proceedings Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
Google Scholar
N. Fuhr, S. Hartmanna, G. Lustig, M. Schwantner, and K. Tzeras. Air/x-a rule-based multistage indexing systems for large subject fields. Proceedings RIAO’91 Conference, 1991, 606–623.
Google Scholar
Y. Yang and C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3): 252–277, 1994
Article Google Scholar
B. Masand, G. Linoff, and D. Waltz. Classifying news stories using memory-based reasoning. Proceedings 15th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, 59–65.
Google Scholar
W. Lam and C.Y. Ho. Using a generalized instance set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.
Google Scholar
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings 17th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’94), 1994, 13–22.
Google Scholar
D.D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 4–15.
Google Scholar
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. Proceedings 14th International Conference on Machine Learning (ICML’97), 1997, 170–178.
Google Scholar
W.W. Cohen. Text categorization and relational learning. Proceedings 12th International Conference on Machine Learning (ICML’95), Morgan Kaufmann, 1995.
Google Scholar
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Proceedings 19th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’96), 1996, 307–315.
Google Scholar
E. Wiener, J.O. Pedersen, and A.S. Weigend. A neural network approach to topic spotting. Proceedings 4th Symposium on Document Analysis and Information Retrieval (SDAIR’95), 1995.
Google Scholar
H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. Proceedings 20th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’97), 1997, 67–73.
Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 137–142.
Google Scholar
R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Proceedings 11th Conference on Computational Learning Theory, 1998, 80–91.
Google Scholar
A. McCallum and K. Nigam. A comparison of event models for navie bayes text classification. Proceedings AAAI-98 Workshop on Learning for Text Categorization, 1998.
Google Scholar
G. Salton, A. Wong, and C.S. Yang. A vector space model got automatic indexing. K.S. Jones and P. Willett (Eds.), Readings in Information Retrieval. Morgan Kaufmann, 1997. 273–280.
Google Scholar
S. Zhou and J. Guan. Chinese documents classification based on N-grams. A. Gelbukh (Ed.): Intelligent Text Processing and Computational Linguistics, LNCS 2276, Springer-Verlag, 2002, 405–414.
Chapter Google Scholar
S. Zhou, Y. Fan, J. Hu, F. Yu, and Y. Hu. Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure. H. Lu, and A. Zhou (Eds.), Web-Age Information Management. LNCS 1846, Springer-Verlag, 2000, 215–226.
Chapter Google Scholar
S. Zhou. Key Techniques of Chinese Text Database. PhD thesis of Fudan University, China. 2000.
Google Scholar
W. Lam and C.Y. Ho. Using a generalized instances set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.
Google Scholar
E.H. Han and G. Karypis. Centroid-based document classification algorithm: Analysis and experimental results. Technical Report TR-00-017, Dept. of CS, Uni. of Minnesota, Minneapolis, 2000. http://www.cs.umn.edu/~karypisD.R. Wilson and A.R. Martinez. Instance pruning techniques. Proceedings 14th International Conference on Machine Learning, 1997.
B. Smyth and M.T. Keane. Remembering to forget. Proceedings 14th International Conference on Artificial Intelligence, Vol.1, 1995, 377–382.
Google Scholar
J. Zhang. Selecting typical instances in instance-based learning. Proceedings 9th International Conference on Machine Learning, 1992, 470–479.
Google Scholar
W. Daelemans, A. Van Den Bosch, and J. Zavrel. Forgetting exceptions is harmful in language learning. Machine Learning, 34(1/3): 11–41, 1999.
Article MATH Google Scholar
W.B. Frakes and R. Baeza-Yates (Eds.). Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA. 1992.
Google Scholar
B.V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. MacGraw-Hill Computer Science Series. IEEE Computer Society Press, Las Alamitos, California, USA. 1991.
Google Scholar
E. Bertino et al. Indexing Techniques for Advanced Database Systems. Kluwer Academic, 1997.
Google Scholar
D.A. White and R. Jain. Similarity indexing with the SS-tree. Proceedings 12th IEEE International Conference on Data Engineering (ICDE’96), 1996, 516–523
Google Scholar
C. Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the distance: An efficient method to kNN processing. Proceedings 27th International Conference on Very Large Databases (VLDB 2001), 2001, 421–430
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Software Engineering, Wuhan University, 430072, Wuhan, China
Shuigeng Zhou
School of Computer Science, Wuhan University, 430072, Wuhan, China
Jihong Guan

Authors

Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Guan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Aristotle University, 54006, Thessaloniki, Greece
Yannis Manolopoulos
Department of Computer Science and Engineering, Slovak University of Technology, Ilkovicova 3, 81219, Bratislava, Slovakia
Pavol Návrat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, S., Guan, J. (2002). An Approach to Improve Text Classification Efficiency. In: Manolopoulos, Y., Návrat, P. (eds) Advances in Databases and Information Systems. ADBIS 2002. Lecture Notes in Computer Science, vol 2435. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45710-0_6

Download citation

DOI: https://doi.org/10.1007/3-540-45710-0_6
Published: 23 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44138-0
Online ISBN: 978-3-540-45710-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics