Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages

Steinberger, Ralf; Ehrmann, Maud; Pajzs, Júlia; Ebrahim, Mohamed; Steinberger, Josef; Turchi, Marco

doi:10.1007/978-3-642-40585-3_3

Ralf Steinberger²⁰,
Maud Ehrmann²¹,
Júlia Pajzs²²,
Mohamed Ebrahim²³,
Josef Steinberger²⁴ &
…
Marco Turchi²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

2449 Accesses

Abstract

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?

Invited talk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mohamed, E., Ehrmann, M., Turchi, M., Steinberger, R.: Multi-label EuroVoc classification for Eastern and Southern EU Languages. In: Vertan, C., Hahn, W. (eds.) Multilingual Processing in Eastern and Southern EU languages - Low-resourced Technologies and Translation, pp. 370–394. Cambridge Scholars Publishing, Cambridge (2012)
Google Scholar
Farkas, R., Szarvas, G., Kocsor, A.: Named entity recognition for Hungarian using various machine learning algorithms. Acta Cybernetica 17(3), 633–646 (2006)
MATH Google Scholar
Klementiev, A., Roth, D.: Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In: Proceedings of ACL 2006 Conference (2006)
Google Scholar
Konkol, M., Konopík, M.: Maximum Entropy Named Entity Recognition for Czech Language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 203–210. Springer, Heidelberg (2011)
Chapter Google Scholar
Küçük, D., Yazıcı, A.: Named Entity Recognition Experiments on Turkish Texts. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 524–535. Springer, Heidelberg (2009)
Chapter Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: Proceedings of the 26th European Conference on Information Retrieval Research, Sunderland, UK (2004)
Google Scholar
Piskorski, J.: Extraction of Polish Named-Entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pp. 313–316 (2004)
Google Scholar
Piskorski, J., Wieloch, K., Sydow, M.: On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages. Inf. Retrieval 12, 275–299 (2009)
Article Google Scholar
Pouliquen, B., Steinberger, R.: Automatic Construction of Multilingual Name Dictionaries. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation. Advances in Neural Information Processing Systems Series (NIPS), pp. 59–78. MIT Press (2009)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Proceedings of the Workshop ‘Ontologies and Information Extraction’ at the EuroLan Summer School ‘The Semantic Web and Language Technology’ (EUROLAN 2003), Bucharest, Romania (2003)
Google Scholar
Pouliquen, B., Steinberger, R., Deguernel, O.: Story tracking: linking similar news over time and across languages. In: Proceedings of the 2nd Workshop Multi-source Multilingual Information Extraction and Summarization (MMIES 2008) held at CoLing 2008, Manchester, UK (2008)
Google Scholar
Steinberger, R.: A survey of methods to ease the development of highly multilingual Text Mining applications. Language Resources and Evaluation Journal 46(2), 155–176 (2012)
Article MathSciNet Google Scholar
Steinberger, R., Pouliquen, B., van der Goot, E.: An Introduction to the Europe Media Monitor Family of Applications. In: Gey, F., Kando, N., Karlgren, J. (eds.) Information Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), Boston, USA, pp. 1–8 (2009)
Google Scholar
Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, pp. 798–805 (2012)
Google Scholar
Toman, M., Tesar, R., Ježek, K.: Influence of Word Normalization on Text Classification. In: Proceedings of InSciT 2006, Merida, Spain (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

European Commission - Joint Research Centre, IPSC-GlobeSec, Ispra, VA, Italy
Ralf Steinberger
Department of Computer Science, Sapienza University of Rome, Rome, Italy
Maud Ehrmann
Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary
Júlia Pajzs
Cognizant SetCon, Munich, Germany
Mohamed Ebrahim
Faculty of Applied Sciences, Department of Computer Science and Engineering, NTIS Centre, University of West Bohemia, Pilsen, Czech Republic
Josef Steinberger
Human Language Technology group, Fondazione Bruno Kessler, Trento, Italy
Marco Turchi

Authors

Ralf Steinberger
View author publications
You can also search for this author in PubMed Google Scholar
Maud Ehrmann
View author publications
You can also search for this author in PubMed Google Scholar
Júlia Pajzs
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Ebrahim
View author publications
You can also search for this author in PubMed Google Scholar
Josef Steinberger
View author publications
You can also search for this author in PubMed Google Scholar
Marco Turchi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of West Bohemia, 306 14, Pilsen, Czech Republic
Ivan Habernal & Václav Matoušek &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steinberger, R., Ehrmann, M., Pajzs, J., Ebrahim, M., Steinberger, J., Turchi, M. (2013). Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-40585-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics