Abstract
This paper presents a Bayes document classifier using phrases as features. The phrases are extracted using a grammar that iteratively applies the rules to the sequence of words in the document. This grammar is generated from a training set using statistical word association. We report an improvement in the classification over the “bag of words” representation.
This work was partially supported by NSERC strategic grant and TL-NCE.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ahonen-Myka, H., Heinonen, O., Klemettinen, M., AND Verkamo, A. I. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of 16th International Joint Conference on Artificial Intelligence (1999).
Church, K. W., AND Hanks, P. Word association norms, mutual information and lexicography. Computational Linguistics 16,1 (Mar. 1990), 22–29.
Cover, T. M., AND Thomas, J. A. Elements of Information Theory. Wiley and Sons, Inc., 1991.
Domingos, P., AND Pazzani, M. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29 (1997), 103–130.
Fagan, J. L. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, USA, 1997.
Joachims, T. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In International Conference on Machine Learning (ICML) (1997).
Kosala, R., AND Blockeel, H. Web mining research: A survey. ACM SIGKDD Explorations Newsletter 2,1 (June 2000).
Mladenić, D., AND Grobelnik, M. Word sequences as features in text-learning. In Proceedings of the 17th Electrotechnical and Computer Sciences Conference (ERK-98) (Ljubljana, Slovenia, 1998).
Ries, K., Buø, F., AND Waibel, A. Class phrase models for language modelling. In Proc. of the 4th International Conference on Spoken Language Processing (IC-SLP’96) (1996).
Smadja, F. Retrieving collocations form text: Xtract. Computational Linguistics 19,1 (1993), 143–177.
Yang, Y., AND Pedersen, J. O. A comparative study on feature selection in text categorization. In Text Categorization Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97) (1997).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bakus, J., Kamel, M. (2002). Document Classification Using Phrases. In: Caelli, T., Amin, A., Duin, R.P.W., de Ridder, D., Kamel, M. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2002. Lecture Notes in Computer Science, vol 2396. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-70659-3_58
Download citation
DOI: https://doi.org/10.1007/3-540-70659-3_58
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44011-6
Online ISBN: 978-3-540-70659-5
eBook Packages: Springer Book Archive