The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance

Daudaravicius, Vidas

doi:10.1007/978-3-642-12116-6_55

Vidas Daudaravicius¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1808 Accesses
2 Citations

Abstract

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Choueka, Y.: Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24 (1988)
Google Scholar
Civera, J., Juan, A.: Bilingual Machine-Aided Indexing. In: Proceedings of the 5th international conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, May 24-26, pp. 1302–1305 (2006)
Google Scholar
Civera, J.: Novel statistical approaches to text classification, machine translation and computer-assisted translation. PhD thesis, Universidad Politécnica de Valencia (2008)
Google Scholar
Daudaravicius, V., Marcinkeviciene, R.: Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics 9(2), 321–348 (2004)
Article Google Scholar
Dhillon, I., Kogan, J., Nicholas, C.: Feature Selection and Document Clustering. Survey of Text Mining: Clustering, Classification, and Retrieval (2004)
Google Scholar
Fox, C.: A stop list for general text. ACM-SIGIR Forum 24, 19–35 (1990)
Article Google Scholar
Hao, L., Hao, L.: Automatic Identification of Stop Words in Chinese Text Classification. In: Proceedings of the 2008 international Conference on Computer Science and Software Engineering, CSSE, December 12 - 14, vol. 01, pp. 718–722. IEEE Computer Society, Washington (2008)
Chapter Google Scholar
Lin, D.: Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal (1998)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
MATH Google Scholar
Névéol, A., Mork, J.G., Aronson, A.R., Darmoni, S.J.: Evaluation of French and English mesh indexing systems with a parallel corpus. In: Proceedings of the AMIA Symposium (2005)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction at the Summer School, The Semantic Web and Language Technology - Its Potential and Practicalities, Bucharest, Romania (2003)
Google Scholar
Salton, G.: Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Smadja, F.: Retrieving Collocations from Text: XTRACT. Computational Linguistics 19(1), 143–177 (1993)
Google Scholar
Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1), 1–38 (1996)
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, May 24-26, pp. 2142–2147 (2006)
Google Scholar
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp. 138–146 (2006)
Google Scholar
Thesaurus Eurovoc (2009), http://europa.eu/eurovoc/
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. In: Proceedings of CoNLL 2000, Lisbon, Portugal, pp. 127–132 (2000)
Google Scholar
Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2005)
MATH Google Scholar
Van Der Zwaan, J., Tjong Kim Sang, E.F., De Rijke, M.: An Experiment in Automatic Classification of Pathological Reports. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 207–216. Springer, Heidelberg (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Vytautas Magnus University, Vileikos 8, Kaunas, Lithuania
Vidas Daudaravicius

Authors

Vidas Daudaravicius
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Daudaravicius, V. (2010). The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_55

Download citation

DOI: https://doi.org/10.1007/978-3-642-12116-6_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics