Abstract
Text classification is becoming more and more important with the rapid growth of on-line information available. In this paper, we propose an approach to speedup the process of text classification based on pruning the training corpus. Effective algorithm for text corpus pruning is designed. Experiments over real-world text corpus are carried out, which validates the effectiveness and efficiency of the proposed approach. Our approach is especially suitable for applications of on-line text classification.
This work was supported by the Provincial Natural Science Foundation of Hubei of China (No. 2001ABB050) and the Natural Science Foundation of China (NSFC) (No. 60173027).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Y. Yang and X. Liu. A re-examination of text categorization. Proceedings 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’99), 1999.
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Proceedings Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
N. Fuhr, S. Hartmanna, G. Lustig, M. Schwantner, and K. Tzeras. Air/x-a rule-based multistage indexing systems for large subject fields. Proceedings RIAO’91 Conference, 1991, 606–623.
Y. Yang and C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3): 252–277, 1994
B. Masand, G. Linoff, and D. Waltz. Classifying news stories using memory-based reasoning. Proceedings 15th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, 59–65.
W. Lam and C.Y. Ho. Using a generalized instance set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings 17th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’94), 1994, 13–22.
D.D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 4–15.
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. Proceedings 14th International Conference on Machine Learning (ICML’97), 1997, 170–178.
W.W. Cohen. Text categorization and relational learning. Proceedings 12th International Conference on Machine Learning (ICML’95), Morgan Kaufmann, 1995.
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Proceedings 19th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’96), 1996, 307–315.
E. Wiener, J.O. Pedersen, and A.S. Weigend. A neural network approach to topic spotting. Proceedings 4th Symposium on Document Analysis and Information Retrieval (SDAIR’95), 1995.
H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. Proceedings 20th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’97), 1997, 67–73.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 137–142.
R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Proceedings 11th Conference on Computational Learning Theory, 1998, 80–91.
A. McCallum and K. Nigam. A comparison of event models for navie bayes text classification. Proceedings AAAI-98 Workshop on Learning for Text Categorization, 1998.
G. Salton, A. Wong, and C.S. Yang. A vector space model got automatic indexing. K.S. Jones and P. Willett (Eds.), Readings in Information Retrieval. Morgan Kaufmann, 1997. 273–280.
S. Zhou and J. Guan. Chinese documents classification based on N-grams. A. Gelbukh (Ed.): Intelligent Text Processing and Computational Linguistics, LNCS 2276, Springer-Verlag, 2002, 405–414.
S. Zhou, Y. Fan, J. Hu, F. Yu, and Y. Hu. Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure. H. Lu, and A. Zhou (Eds.), Web-Age Information Management. LNCS 1846, Springer-Verlag, 2000, 215–226.
S. Zhou. Key Techniques of Chinese Text Database. PhD thesis of Fudan University, China. 2000.
W. Lam and C.Y. Ho. Using a generalized instances set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.
E.H. Han and G. Karypis. Centroid-based document classification algorithm: Analysis and experimental results. Technical Report TR-00-017, Dept. of CS, Uni. of Minnesota, Minneapolis, 2000. http://www.cs.umn.edu/~karypisD.R. Wilson and A.R. Martinez. Instance pruning techniques. Proceedings 14th International Conference on Machine Learning, 1997.
B. Smyth and M.T. Keane. Remembering to forget. Proceedings 14th International Conference on Artificial Intelligence, Vol.1, 1995, 377–382.
J. Zhang. Selecting typical instances in instance-based learning. Proceedings 9th International Conference on Machine Learning, 1992, 470–479.
W. Daelemans, A. Van Den Bosch, and J. Zavrel. Forgetting exceptions is harmful in language learning. Machine Learning, 34(1/3): 11–41, 1999.
W.B. Frakes and R. Baeza-Yates (Eds.). Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA. 1992.
B.V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. MacGraw-Hill Computer Science Series. IEEE Computer Society Press, Las Alamitos, California, USA. 1991.
E. Bertino et al. Indexing Techniques for Advanced Database Systems. Kluwer Academic, 1997.
D.A. White and R. Jain. Similarity indexing with the SS-tree. Proceedings 12th IEEE International Conference on Data Engineering (ICDE’96), 1996, 516–523
C. Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the distance: An efficient method to kNN processing. Proceedings 27th International Conference on Very Large Databases (VLDB 2001), 2001, 421–430
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, S., Guan, J. (2002). An Approach to Improve Text Classification Efficiency. In: Manolopoulos, Y., Návrat, P. (eds) Advances in Databases and Information Systems. ADBIS 2002. Lecture Notes in Computer Science, vol 2435. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45710-0_6
Download citation
DOI: https://doi.org/10.1007/3-540-45710-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44138-0
Online ISBN: 978-3-540-45710-7
eBook Packages: Springer Book Archive