Developing Multilingual Text Mining Workflows in UIMA and U-Compare

Kontonasios, Georgios; Korkontzelos, Ioannis; Ananiadou, Sophia

doi:10.1007/978-3-642-31178-9_8

Developing Multilingual Text Mining Workflows in UIMA and U-Compare

Georgios Kontonasios¹⁹,
Ioannis Korkontzelos¹⁹ &
Sophia Ananiadou¹⁹

Conference paper

2311 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7337))

Abstract

We present a generic, language-independent method for the construction of multilingual text mining workflows. The proposed mechanism is implemented as an extension of U-Compare, a platform built on top of the Unstructured Information Management Architecture (UIMA) that allows the construction, comparison and evaluation of interoperable text mining workflows. UIMA was previously supporting strictly monolingual workflows. Building multilingual workflows exhibits challenging problems, such as representing multilingual document collections and executing language-dependent components in parallel. As an application of our method, we develop a multilingual workflow that extracts terms from a parallel collection using a new heuristic. For our experiments, we construct a parallel corpus consisting of approximately 188.000 PubMed article titles for French and English. Our application is evaluated against a popular monolingual term extraction method, C Value.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Oinn, T., Greenwood, M., Addis, M., Alpdemir, M., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., et al.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18(10) (2006)
Google Scholar
Goecks, J., Nekrutenko, A., Taylor, J., Team, T.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8), R86 (2010)
Google Scholar
Rowe, A., Kalaitzopoulos, D., Osmond, M., Ghanem, M., Guo, Y.: The discovery net system for high throughput bioinformatics. Bioinformatics 19(suppl. 1) (2003)
Google Scholar
Barseghian, D., Altintas, I., Jones, M., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E., Seabloom, E.: Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecological Informatics 5(1) (2010)
Google Scholar
Ferrucci, D., Lally, A.: Building an example application with the unstructured information management architecture. IBM Systems Journal 43(3) (2004)
Google Scholar
Kano, Y., Miwa, M., Cohen, B., Hunter, L., Ananiadou, S., Tsujii, J.: U-Compare: A modular nlp workflow construction and evaluation system. IBM Journal of Research and Development 55(3) (2011)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002)
Google Scholar
Besançon, R., de Chalendar, G., Ferret, O., Gara, F., Semmar, N.: Lima: A multilingual framework for linguistic analysis and linguistic resources development and evaluation. In: 7th Conference on Language Resources and Evaluation (LREC 2010), Malta (2010)
Google Scholar
Ogrodniczuk, M., Karagiozov, D.: Atlas multilingual language processing platform. Procesamiento de Lenguaje Natural 47 (2011)
Google Scholar
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries 3(2) (2000)
Google Scholar
Bontcheva, K., Maynard, D., Tablan, V., Cunningham, H.: Gate: A unicode-based infrastructure supporting multilingual information extraction. In: Proceedings on Information Extraction for Slavonic and other Central and Eastern European Languages, Borovets, Bulgaria (2003)
Google Scholar
Francopoulo, G., George, M., Calzolari, N., Monachini, M., Bel, N., Pet, M., Soria, C.: Lexical markup framework (lmf). In: International Conference on Language Resources and Evaluation-LREC. Number 2006 (2006)
Google Scholar
Harris, Z.: Distributional structure. Word (1954)
Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (1995)
Google Scholar
Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora. (1997)
Google Scholar
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling french-japanese terminologies from the web. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics (2006)
Google Scholar
Morin, E., Daille, B.: Compositionality and lexical alignment of multi-word terms. Language Resources and Evaluation 44(1) (2010)
Google Scholar
Ananiadou, S., Mcnaught, J.: Text mining for biology and biomedicine. Artech House Publishers (2006)
Google Scholar
Bernhard, D.: Multilingual term extraction from domain-specific corpora using morphological structure. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics (2006)
Google Scholar
Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: Klavans, J., Resnik, P. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge (1996)
Google Scholar
Fidelia, I.: Terminological variation, a means of identifying research topics from texts. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 1. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
Fung, P., McKeown, K.: A technical word-and term-translation aid using noisy parallel corpora across language groups. Machine Translation 12(1) (1997)
Google Scholar
Sheridan, P., Braschlert, M., Schäuble, P.: Cross-language Information Retrieval in a Multilingual Legal Domain. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 253–268. Springer, Heidelberg (1997)
Chapter Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Chapter Google Scholar
Schmid, H.: Treetagger: a language independent part-of-speech tagger (1995), www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Zweigenbaum, P., Baud, R., Burgun, A., Namer, F., Jarrousse, E., Grabar, N., Ruch, P., Le Duff, F., Forget, J., Douyere, M.: Umlf: a unified medical lexicon for french. International Journal of Medical Informatics 74(2-4) (2005)
Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3) (1990)
Google Scholar
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3) (1970)
Google Scholar
Wu, X., Matsuzaki, T., Tsujii, J.: Fine-grained tree-to-string translation rule extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2010)
Google Scholar
Okazaki, N., Tsujii, J.: Simple and efficient algorithm for approximate dictionary matching. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

National Centre for Text Mining, School of Computer Science, The University of Manchester, UK
Georgios Kontonasios, Ioannis Korkontzelos & Sophia Ananiadou

Authors

Georgios Kontonasios
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Korkontzelos
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Ananiadou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Science Department, University of Groningen, Oude Kijk in ’t Jatstraat 26, 9712 EK, Groningen, The Netherlands
Gosse Bouma
Faculty of Economics and Business, University of Groningen, Nettelbosje 2, 9747 AE, Groningen, The Netherlands
Ashwin Ittoo & Hans Wortmann &
CNAM-Laboratoire Cédric, 292 rue St. Martin, 75141, Paris Cedex 03, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kontonasios, G., Korkontzelos, I., Ananiadou, S. (2012). Developing Multilingual Text Mining Workflows in UIMA and U-Compare. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds) Natural Language Processing and Information Systems. NLDB 2012. Lecture Notes in Computer Science, vol 7337. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31178-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-31178-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31177-2
Online ISBN: 978-3-642-31178-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics