Skip to main content

An Approach to Improve Text Classification Efficiency

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2435))

Abstract

Text classification is becoming more and more important with the rapid growth of on-line information available. In this paper, we propose an approach to speedup the process of text classification based on pruning the training corpus. Effective algorithm for text corpus pruning is designed. Experiments over real-world text corpus are carried out, which validates the effectiveness and efficiency of the proposed approach. Our approach is especially suitable for applications of on-line text classification.

This work was supported by the Provincial Natural Science Foundation of Hubei of China (No. 2001ABB050) and the Natural Science Foundation of China (NSFC) (No. 60173027).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Y. Yang and X. Liu. A re-examination of text categorization. Proceedings 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’99), 1999.

    Google Scholar 

  2. C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. Proceedings Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.

    Google Scholar 

  3. N. Fuhr, S. Hartmanna, G. Lustig, M. Schwantner, and K. Tzeras. Air/x-a rule-based multistage indexing systems for large subject fields. Proceedings RIAO’91 Conference, 1991, 606–623.

    Google Scholar 

  4. Y. Yang and C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3): 252–277, 1994

    Article  Google Scholar 

  5. B. Masand, G. Linoff, and D. Waltz. Classifying news stories using memory-based reasoning. Proceedings 15th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, 59–65.

    Google Scholar 

  6. W. Lam and C.Y. Ho. Using a generalized instance set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.

    Google Scholar 

  7. Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings 17th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’94), 1994, 13–22.

    Google Scholar 

  8. D.D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 4–15.

    Google Scholar 

  9. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. Proceedings 14th International Conference on Machine Learning (ICML’97), 1997, 170–178.

    Google Scholar 

  10. W.W. Cohen. Text categorization and relational learning. Proceedings 12th International Conference on Machine Learning (ICML’95), Morgan Kaufmann, 1995.

    Google Scholar 

  11. W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Proceedings 19th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’96), 1996, 307–315.

    Google Scholar 

  12. E. Wiener, J.O. Pedersen, and A.S. Weigend. A neural network approach to topic spotting. Proceedings 4th Symposium on Document Analysis and Information Retrieval (SDAIR’95), 1995.

    Google Scholar 

  13. H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. Proceedings 20th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’97), 1997, 67–73.

    Google Scholar 

  14. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Proceedings 10th European Conference on Machine Learning (ECML’98), 1998, 137–142.

    Google Scholar 

  15. R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Proceedings 11th Conference on Computational Learning Theory, 1998, 80–91.

    Google Scholar 

  16. A. McCallum and K. Nigam. A comparison of event models for navie bayes text classification. Proceedings AAAI-98 Workshop on Learning for Text Categorization, 1998.

    Google Scholar 

  17. G. Salton, A. Wong, and C.S. Yang. A vector space model got automatic indexing. K.S. Jones and P. Willett (Eds.), Readings in Information Retrieval. Morgan Kaufmann, 1997. 273–280.

    Google Scholar 

  18. S. Zhou and J. Guan. Chinese documents classification based on N-grams. A. Gelbukh (Ed.): Intelligent Text Processing and Computational Linguistics, LNCS 2276, Springer-Verlag, 2002, 405–414.

    Chapter  Google Scholar 

  19. S. Zhou, Y. Fan, J. Hu, F. Yu, and Y. Hu. Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure. H. Lu, and A. Zhou (Eds.), Web-Age Information Management. LNCS 1846, Springer-Verlag, 2000, 215–226.

    Chapter  Google Scholar 

  20. S. Zhou. Key Techniques of Chinese Text Database. PhD thesis of Fudan University, China. 2000.

    Google Scholar 

  21. W. Lam and C.Y. Ho. Using a generalized instances set for automatic text categorization. Proceedings 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998, 81–89.

    Google Scholar 

  22. E.H. Han and G. Karypis. Centroid-based document classification algorithm: Analysis and experimental results. Technical Report TR-00-017, Dept. of CS, Uni. of Minnesota, Minneapolis, 2000. http://www.cs.umn.edu/~karypisD.R. Wilson and A.R. Martinez. Instance pruning techniques. Proceedings 14th International Conference on Machine Learning, 1997.

  23. B. Smyth and M.T. Keane. Remembering to forget. Proceedings 14th International Conference on Artificial Intelligence, Vol.1, 1995, 377–382.

    Google Scholar 

  24. J. Zhang. Selecting typical instances in instance-based learning. Proceedings 9th International Conference on Machine Learning, 1992, 470–479.

    Google Scholar 

  25. W. Daelemans, A. Van Den Bosch, and J. Zavrel. Forgetting exceptions is harmful in language learning. Machine Learning, 34(1/3): 11–41, 1999.

    Article  MATH  Google Scholar 

  26. W.B. Frakes and R. Baeza-Yates (Eds.). Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA. 1992.

    Google Scholar 

  27. B.V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. MacGraw-Hill Computer Science Series. IEEE Computer Society Press, Las Alamitos, California, USA. 1991.

    Google Scholar 

  28. E. Bertino et al. Indexing Techniques for Advanced Database Systems. Kluwer Academic, 1997.

    Google Scholar 

  29. D.A. White and R. Jain. Similarity indexing with the SS-tree. Proceedings 12th IEEE International Conference on Data Engineering (ICDE’96), 1996, 516–523

    Google Scholar 

  30. C. Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the distance: An efficient method to kNN processing. Proceedings 27th International Conference on Very Large Databases (VLDB 2001), 2001, 421–430

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, S., Guan, J. (2002). An Approach to Improve Text Classification Efficiency. In: Manolopoulos, Y., Návrat, P. (eds) Advances in Databases and Information Systems. ADBIS 2002. Lecture Notes in Computer Science, vol 2435. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45710-0_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-45710-0_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44138-0

  • Online ISBN: 978-3-540-45710-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics