Abstract
Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Choueka, Y.: Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24 (1988)
Civera, J., Juan, A.: Bilingual Machine-Aided Indexing. In: Proceedings of the 5th international conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, May 24-26, pp. 1302–1305 (2006)
Civera, J.: Novel statistical approaches to text classification, machine translation and computer-assisted translation. PhD thesis, Universidad Politécnica de Valencia (2008)
Daudaravicius, V., Marcinkeviciene, R.: Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics 9(2), 321–348 (2004)
Dhillon, I., Kogan, J., Nicholas, C.: Feature Selection and Document Clustering. Survey of Text Mining: Clustering, Classification, and Retrieval (2004)
Fox, C.: A stop list for general text. ACM-SIGIR Forum 24, 19–35 (1990)
Hao, L., Hao, L.: Automatic Identification of Stop Words in Chinese Text Classification. In: Proceedings of the 2008 international Conference on Computer Science and Software Engineering, CSSE, December 12 - 14, vol. 01, pp. 718–722. IEEE Computer Society, Washington (2008)
Lin, D.: Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal (1998)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Névéol, A., Mork, J.G., Aronson, A.R., Darmoni, S.J.: Evaluation of French and English mesh indexing systems with a parallel corpus. In: Proceedings of the AMIA Symposium (2005)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction at the Summer School, The Semantic Web and Language Technology - Its Potential and Practicalities, Bucharest, Romania (2003)
Salton, G.: Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Smadja, F.: Retrieving Collocations from Text: XTRACT. Computational Linguistics 19(1), 143–177 (1993)
Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1), 1–38 (1996)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, May 24-26, pp. 2142–2147 (2006)
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp. 138–146 (2006)
Thesaurus Eurovoc (2009), http://europa.eu/eurovoc/
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. In: Proceedings of CoNLL 2000, Lisbon, Portugal, pp. 127–132 (2000)
Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2005)
Van Der Zwaan, J., Tjong Kim Sang, E.F., De Rijke, M.: An Experiment in Automatic Classification of Pathological Reports. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 207–216. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Daudaravicius, V. (2010). The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_55
Download citation
DOI: https://doi.org/10.1007/978-3-642-12116-6_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)