Abstract
As the number of published documents increase quickly, there is a crucial need for fast and sensitive categorization methods to manage the produced information. In this paper, we focused on the categorization of biomedical documents with concepts of the Gene Ontology, an ontology dedicated to gene description. Our approach discovers associations between the predefined concepts and the documents using string matching techniques. The assignations are ranked according to a score computed given several strategies. The effects of these different scoring strategies on the categorization effectiveness are evaluated. More especially a new weighting technique based on term frequency is presented. This new weighting technique improves the categorization effectiveness on most of the experiment performed. This paper shows that a cleaver use of the frequency can bring substantial benefits when performing automatic categorization on large collection of documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rice, S., Nenadic, G., Stapley, B.: Protein function assignment using term-based support vector machines – bioCreative Task Two 203. BioCreative NoteBook Papers (2004)
Krallinger, M., Padron, M.: Prediction of GO annotation by combining entity specific sentence sliding window profiles. BioCreative NoteBook Papers (2004)
Ruch, P., Chichester, C., Cohen, G., Ehrler, F., Fabry, P.J.M., Muller, H., Geissbuhler, A.: A Report on the TREC 2004 Experiment: Genomics Track. The Thirteenth Text Retrieval Conference, TREC-2004, Gaithersburg, MD (2004)
Goertzel, B., Goertzel, I., Pennachin, C., Looks, M., Queiroz, M., Prosdocimi, F., Lobo, F.: Inferring Gene Ontology Category Membership via Cross-Experiment Gene Expression Data Analysis
The Gene Ontology Consortium,: Creating the Gene Ontology Resource: Design and Implementation. Genome. Res. 11, 1425–1433 (2001)
MCCray, A., Brown, A., Bodenreider, O.: The lexical Properties of the Gene Ontology. In: AMIA Annual Symposium, pp. 504–508 (2002)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1997)
Kraaij, W., Pohlmann, R.: Viewing Stemming as Recall Enhancement. In: 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 40–48 (1996)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ehrler, F., Ruch, P. (2007). Unsupervised Documents Categorization Using New Threshold-Sensitive Weighting Technique. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds) Artificial Intelligence in Medicine. AIME 2007. Lecture Notes in Computer Science(), vol 4594. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73599-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-73599-1_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73598-4
Online ISBN: 978-3-540-73599-1
eBook Packages: Computer ScienceComputer Science (R0)