Skip to main content

The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Abstract

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Choueka, Y.: Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24 (1988)

    Google Scholar 

  2. Civera, J., Juan, A.: Bilingual Machine-Aided Indexing. In: Proceedings of the 5th international conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, May 24-26, pp. 1302–1305 (2006)

    Google Scholar 

  3. Civera, J.: Novel statistical approaches to text classification, machine translation and computer-assisted translation. PhD thesis, Universidad Politécnica de Valencia (2008)

    Google Scholar 

  4. Daudaravicius, V., Marcinkeviciene, R.: Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics 9(2), 321–348 (2004)

    Article  Google Scholar 

  5. Dhillon, I., Kogan, J., Nicholas, C.: Feature Selection and Document Clustering. Survey of Text Mining: Clustering, Classification, and Retrieval (2004)

    Google Scholar 

  6. Fox, C.: A stop list for general text. ACM-SIGIR Forum 24, 19–35 (1990)

    Article  Google Scholar 

  7. Hao, L., Hao, L.: Automatic Identification of Stop Words in Chinese Text Classification. In: Proceedings of the 2008 international Conference on Computer Science and Software Engineering, CSSE, December 12 - 14, vol. 01, pp. 718–722. IEEE Computer Society, Washington (2008)

    Chapter  Google Scholar 

  8. Lin, D.: Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal (1998)

    Google Scholar 

  9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  10. Névéol, A., Mork, J.G., Aronson, A.R., Darmoni, S.J.: Evaluation of French and English mesh indexing systems with a parallel corpus. In: Proceedings of the AMIA Symposium (2005)

    Google Scholar 

  11. Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction at the Summer School, The Semantic Web and Language Technology - Its Potential and Practicalities, Bucharest, Romania (2003)

    Google Scholar 

  12. Salton, G.: Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  13. Smadja, F.: Retrieving Collocations from Text: XTRACT. Computational Linguistics 19(1), 143–177 (1993)

    Google Scholar 

  14. Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translation Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1), 1–38 (1996)

    Google Scholar 

  15. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, May 24-26, pp. 2142–2147 (2006)

    Google Scholar 

  16. Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, pp. 138–146 (2006)

    Google Scholar 

  17. Thesaurus Eurovoc (2009), http://europa.eu/eurovoc/

  18. Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. In: Proceedings of CoNLL 2000, Lisbon, Portugal, pp. 127–132 (2000)

    Google Scholar 

  19. Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2005)

    MATH  Google Scholar 

  20. Van Der Zwaan, J., Tjong Kim Sang, E.F., De Rijke, M.: An Experiment in Automatic Classification of Pathological Reports. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 207–216. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Daudaravicius, V. (2010). The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12116-6_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12115-9

  • Online ISBN: 978-3-642-12116-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics