Vocabulary Completion Through Word Cooccurrence Analysis Using Unlabeled Documents for Text Categorization

Réhel, Simon; Mineau, Guy W.

doi:10.1007/11424918_39

Simon Réhel²⁰ &
Guy W. Mineau²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3501))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1172 Accesses

Abstract

Automated text categorization consists of developing computer programs able to autonomously assign texts to predefined categories, on the basis of their content. Such applications are possible thanks to supervised learning, which implies a training on manually labeled documents. During this phase, the system discovers links between relevant terms (the vocabulary) and identified categories. However, the construction of a training set is long and expensive. This paper suggests a way to assist text classifiers in the gathering of the vocabulary when the number of examples is limited, in which case the success rate is not at its best. It proposes to analyze word cooccurrence within a collection of non-labeled documents in order to augment the vocabulary used by the classifier. The representation of new documents to classify would benefit from this augmented vocabulary. What is expected is an improvement of the classifier’s success rate despite its limited training set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proc. of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickman, D.: Ohsumed: an Interactive Retrieval Evaluation and New Large Text Collection for Research. In: Proc. of SIGIR 1994, pp. 192–201 (1994)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proc. of the 10th European Conference on Machine Learning, pp. 137–142. Springer, Heidelberg (1998)
Google Scholar
Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training Algorithms for Linear Text Classifiers. In: Proc. of SIGIR 1996, pp. 298–306 (1996)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Rose, T.G., Stevenson, M., Whitehead, M.: The Reuters Corpus Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources. In: Proc. of the 3rd International Conference on Language Resources and Evaluation, pp. 827–832 (2002)
Google Scholar
Sebastiani, F.: A Tutorial on Automated Text Categorisation. In: Proc. of the 1st Argentinian Symposium on Artificial Intelligence, pp. 7–35 (1999)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1/2), 69–90 (1999)
Article Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of SIGIR 1999, pp. 42–49 (1999)
Google Scholar
Zelikovitz, S., Hirsh, H.: Using LSI for Text Classification in the Presence of Background Text. In: Proc. of the 10th ACM International Conference on Information and Knowledge Management, pp. 113–118. ACM Press, New York (2001)
Google Scholar
Zelikovitz, S., Hirsh, H.: Integrating Background Knowledge into Nearest-Neighbor Text Classification. In: Proc. of the 6th European Conference on Case-Based Reasoning, pp. 1–5. Springer, Heidelberg (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Université Laval, Québec, Canada
Simon Réhel & Guy W. Mineau

Authors

Simon Réhel
View author publications
You can also search for this author in PubMed Google Scholar
Guy W. Mineau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Département d’informatique et de recherche opérationelle, CP 6128 succ. Centre-Ville, Université de Montréal, H3C 3J7, Montréal, Canada
Balázs Kégl
Département d’informatique et de recherche opérationnelle, Université de Montréal,
Guy Lapalme

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Réhel, S., Mineau, G.W. (2005). Vocabulary Completion Through Word Cooccurrence Analysis Using Unlabeled Documents for Text Categorization. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_39

Download citation

DOI: https://doi.org/10.1007/11424918_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics