Skip to main content

AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5446))

Abstract

Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don’t improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases. Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2000)

    MATH  Google Scholar 

  3. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons Inc., New York (1998)

    MATH  Google Scholar 

  4. Schapire, R.E., Singer, Y.: BOOSTEXTER: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)

    Article  MATH  Google Scholar 

  5. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, pp. 148–155 (1998)

    Google Scholar 

  6. Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)

    Article  Google Scholar 

  7. Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report, University of Massachusetts at Amherst (2004)

    Google Scholar 

  8. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, Kobenhavn, DK, pp. 37–50 (1992)

    Google Scholar 

  9. Scott, S., Matwin, S.: Feature engineering for text classification. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of ICML 1999, Bled, SL, pp. 379–388 (1999)

    Google Scholar 

  10. Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)

    Google Scholar 

  11. Zhang, D., Lee, W.S.: Question classification using support vector machines. In: Callan, J., Cormack, G., Clarke, C., Hawking, D., Smeaton, A. (eds.) Proceedings of SIGIR 2003, Toronto, CA, pp. 26–32 (2003)

    Google Scholar 

  12. Koster, C.H., Seutter, M.: Taming wild phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  14. Raskutti, B., Ferra, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS, vol. 2167, pp. 419–430. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  15. Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp. 563–569 (2000)

    Google Scholar 

  16. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, pp. 81–93 (1994)

    Google Scholar 

  17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, pp. 412–420 (1997)

    Google Scholar 

  18. Wiener, E.D., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of SDAIR 1995, pp. 317–332 (1995)

    Google Scholar 

  19. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 103–137 (1997)

    Article  MATH  Google Scholar 

  20. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29, 131–163 (1997)

    Article  MATH  Google Scholar 

  21. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: SIGIR 1994, pp. 13–22 (1994)

    Google Scholar 

  22. Yuan, Y., Shaw, M.J.: Induction of fuzzy decision trees. Fuzzy Sets and Systems 69, 125–139 (1995)

    Article  MathSciNet  Google Scholar 

  23. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, Z. et al. (2009). AutoPCS: A Phrase-Based Text Categorization System for Similar Texts. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00672-2_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00671-5

  • Online ISBN: 978-3-642-00672-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics