Abstract
This paper presents an algorithm for extraction of phrases from text documents. The algorithm builds phrases by iteratively merging bigrams according to an association measure. Two association measures are presented: mutual information and t-test. The extracted phrases are tested in a document classification task using a tf/idf model and a k-nearest neighbor classifier.
This work was partially supported by NSERC strategic grant and TL-NCE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Church, K. W., AND Hanks, P. Word association norms, mutual information and lexicography. Computational Linguistics 16,1 (Mar. 1990), 22–29.
Manning, C. D., AND Schütze, H. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, USA, 1999.
Mladenić, D., AND Grobelnik, M. Word sequences as features in text-learning. In Proceedings of the 17th Electrotechnical and Computer Sciences Conference (ERK-98) (Ljubljana, Slovenia, 1998).
Ries, K., Buø, F., AND Waibel, A. Class phrase models for language modelling. In Proc. of the 4th International Conference on Spoken Language Processing (IC-SLP’96) (1996).
Salton, G., Allan, J., AND Buckley, C. Approaches to passage retrieval in full text information systems. In Proceedings of ACM SIGIR conference on research and development in information retrieval (Pittsburgh, PA, USA, 1993), pp. 49–58.
Yang, Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94) (1994).
Yang, Y., AND Pedersen, J. O. A comparative study on feature selection in text categorization. In Text Categorization Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97) (1997).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bakus, J., Kamel, M., Carey, T. (2002). Extraction of Text Phrases Using Hierarchical Grammar. In: Cohen, R., Spencer, B. (eds) Advances in Artificial Intelligence. Canadian AI 2002. Lecture Notes in Computer Science(), vol 2338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47922-8_27
Download citation
DOI: https://doi.org/10.1007/3-540-47922-8_27
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43724-6
Online ISBN: 978-3-540-47922-2
eBook Packages: Springer Book Archive