Skip to main content

Concepticons vs. lexicons: An architecture for multilingual information extraction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1299))

Abstract

Given an information extraction (IE) system that performs an extraction task against texts in one language, it is natural to consider how to modify the system to perform the same task against texts in a different language. More generally, there may be a requirement to do the extraction task against texts in an arbitrary number of different languages and to present results to a user who has no knowledge of the source language from which the information has been extracted. To minimise the language-specific alterations that need to be made in extending the system to a new language, it is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language. In this paper we describe how the architecture of the LaSIE system, an IE system designed to do monolingual extraction from English texts, has been modified to support a clean separation between conceptual and lexical information. This separation allows hard-to-acquire, domain-specific conceptual knowledge to be represented only once, and hence to be reused in extracting information from texts in multiple languages, while standard lexical resources can be used to extend language coverage. Preliminary experiments with extending the system to French are described.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   29.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   39.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Advanced Research Projects Agency. Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, 1993.

    Google Scholar 

  2. H. Alshawi, editor. The Core Language Engine. MIT Press, Cambridge MA, 1992.

    Google Scholar 

  3. AVENTINUS: Advanced information system for multinational drug enforcement. http://www2.echo.lu/langeng/en/lel/aventinus/aventinus.html. Site visited 29/05/97.

    Google Scholar 

  4. J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, 1996.

    Article  Google Scholar 

  5. H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LEl-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996.

    Google Scholar 

  6. Defense Advanced Research Projects Agency. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.

    Google Scholar 

  7. ECRAN: Extraction of Content: Research at Near-Market. http://www2.echo.lu/langeng/en/lei/ecran/ecran.html. Site visited 29/05/97.

    Google Scholar 

  8. FACILE: Fast and Accurate Categorisation of Information by Language Engineering. http://www2.echo.lu/langeng/en/lel/facile/facile.html. Site visited 29/05/97.

    Google Scholar 

  9. R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995.

    Google Scholar 

  10. R. Gaizauskas and K. Humphreys. Using a semantic network for information extraction. Journal of Natural Language Engineering, 1997. In press.

    Google Scholar 

  11. R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.

    Google Scholar 

  12. R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Submitted to Journal of Documentation, 1997.

    Google Scholar 

  13. R. Grishman and B. Sundheim. Message understanding conference — 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996.

    Google Scholar 

  14. H. Horacek and M. Zock, editors. New Concepts in Natural Language Generation: Planning, Realization and Systems. Pinter Publishers, London, 1993.

    Google Scholar 

  15. W.J. Hutchins. Machine Translation: past, present, future. Chichester: Ellis Horwood, 1986.

    Google Scholar 

  16. M. Kameyama. Information Extraction across Linguistic Boundaries. In AAAI Spring Symposium on Cross-Language Text and Speech Processing, 1997.

    Google Scholar 

  17. R. Merchant, M.E. Okurowski, and N. Chinchor. The Multi-Lingual Entity Tast (MET) Overview. In Advances in Text Processing — TIPSTER Programme Phase II, pages 445–447. DARPA, Morgan Kaufman, 1996.

    Google Scholar 

  18. G. A. Miller (Ed.). WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312, 1990.

    Article  Google Scholar 

  19. SPARKLE: Shallow parsing and knowledge extraction for language engineering. http://www2.echo.lu/langeng/en/lei/sparkle/sparkle.html. Site visited 10/06/97.

    Google Scholar 

  20. TREE: Trans European Employment. http://www2.echo.lu/langeng/en/lel/tree/tree.html. Site visited 29/05/97.

    Google Scholar 

  21. Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the ANLP97 Workshop on Tagging Text with Lexical Semantics, 1997.

    Google Scholar 

  22. D. Yarowsky. Word-sense disambiguation using statistical models of Roget's cate-gories trained on large corpora. In COLING-92, 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Maria Teresa Pazienza

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gaizauskas, R., Humphreys, K., Azzam, S., Wilks, Y. (1997). Concepticons vs. lexicons: An architecture for multilingual information extraction. In: Pazienza, M.T. (eds) Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. SCIE 1997. Lecture Notes in Computer Science, vol 1299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63438-X_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-63438-X_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63438-6

  • Online ISBN: 978-3-540-69548-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics