Skip to main content
Log in

Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. See http://www.newsvine.com/. All URLs were last visited in February 2011.

  2. See http://www.silobreaker.com/.

  3. See http://www.daylife.com/.

  4. See http://www.newstin.com/.

  5. See http://emm.newsexplorer.eu/. NewsExplorer processes news articles in Arabic, Bulgarian, Danish, Dutch, English, Estonian, Farsi, French, German, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovene, Spanish, Swahili, Swedish and Turkish.

  6. The reference links to slides of an invited presentation, for which no full paper is available. The slides present a generic approach and various options, without specifying what solution was used for Swahili. No evaluation results are mentioned. Following our request, the first author confirmed that no written publications are available on this work.

  7. See http://www.aakkl.helsinki.fi/cameel/corpus/intro.htm.

  8. See http://www.njas.helsinki.fi/salama/.

  9. See http://www.translate.google.com/.

  10. See http://www.kamusiproject.org/.

  11. See http://www.en.wikipedia.org/wiki/Swahili_language.

  12. The list of news sources regularly monitored by EMM is the following: BBC Swahili, Deutsche Welle, FM Free Media, Habari Leo, Idhaa ya Redio ya UM, Inter Press Service-Africa, IPP Media, IRIB Radio, Mwananchi, New Habari, Nifahamishe, Raia Mwema, VOA News-Sauti ya Amerika and Worldnews Swahili. Some of these media sources do not regularly produce news output. For an up-to-date list of news sources, go to ‘Advanced Search’ on http://emm.newsbrief.eu/ and select the language.

  13. Online dictionaries used are: The Kamusi Project-Internet Living Swahili Dictionary (ILSD) (http://www.kamusiproject.org/), the TshwaneDJe dictionary (http://www.africanlanguages.com/swahili/), the Freedict Swahili-English dictionary, (http://www.freedict.com/onldict/swa.html), as well as the Wikipedia encyclopaedia.

  14. The uppercase requirement will not be applied to languages using the Arabic script as these languages do not distinguish uppercase and lowercase. For further language-specific exceptions, see Steinberger et al. (2008a).

  15. Both accessible via http://www.emm.newsbrief.eu/overview.html, then select the language ‘Sw’.

References

  • Bering, C., Drożdżyński, W., Erbach, G., Guasch, L., Homola, P., Lehmann, S., et al. (2003). Corpora and evaluation tools for multilingual named entity grammar development. In Proceedings of the multilingual corpora workshop at corpus linguistics (pp. 42–52). Lancaster, UK.

  • Carenini, M., Whyte, A., Bertorello, L., & Vanocchi, M. (2007). Improving communication in E-democracy using natural language processing. In IEEE Intelligent Systems, 22(1), 20–27.

    Google Scholar 

  • De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.

    Google Scholar 

  • De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2006). Data-driven part-of-speech tagging of Kiswahili. In Text, speech and dialogue (Vol. 4188, pp. 197–204). Berlin: Springer.

  • De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.

    Google Scholar 

  • De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2011). Exploring the SAWA corpus—Collection and deployment of a parallel corpus English—Swahili. Language Resources and Evaluation Journal. Special Issue on African Language Technology, Springer.

  • Gamon, M., Lozano, C., Pinkham, J., & Reutter, T. (1997). Practical experience with grammar sharing in multilingual NLP. In Proceedings of ACL/EACL, Madrid, Spain, pp. 49–56.

  • Ignat, C., Pouliquen, B., Ribeiro, A., & Steinberger, R. (2003). Extending an information extraction tool set to central and eastern European languages. In Proceedings of the workshop information extraction for slavonic and other central and eastern European languages (IESL’2003) (pp. 33–39). Borovets, Bulgaria, 8–9 Sep 2003.

  • Landauer, T., & Littman, M. (1991). A statistical method for language-independent representation of the topical content of text segments. In 11th International conference expert systems and their applications (Vol. 8, pp. 77–85), Avignon, France.

  • Leek, T., Jin, H., Sista, S., & Schwartz, R. (1999). The BBN crosslingual topic detection and tracking system. In 1999 TDT evaluation system summary papers (pp. 214–221). Vienna, VA, USA.

  • Manny, R., & Bouillon, P. (1996). Adapting the core language engine to French and Spanish. In Proceedings of the international conference NLP+IA,( pp. 224–232). Mouncton, Canada.

  • Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., & Wilks, Y. (2002). Architectural elements of language engineering robustness. Natural Language Engineering, 8(3), 257–274. Special Issue on Robust Methods in Analysis of Natural Language Data.

    Google Scholar 

  • Ng’ang’a, W. (2005). Word sense disambiguation of Swahili: Extending Swahili language technology with machine learning. Ph.D. thesis, Helsinki University.

  • Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Google Scholar 

  • Pastra, K., Maynard, D., Hamza, O., Cunningham, H., & Wilks, Y. (2002). How feasible is the reuse of grammars for Named Entity Recognition? In Proceedings of LREC (pp. 412–1418). Las Palmas, Spain.

  • Pouliquen, B., Kimler, M., Steinberger, R., Ignat, C., Oellinger, T., Blackler, K., et al. (2006). Geocoding multilingual texts: Recognition, disambiguation and visualisation. In Proceedings of LREC’2006, (pp. 53–58). Genoa, Italy, 24–26 May 2006.

  • Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In C. Goutte, N. Cancedda, M. Dymetman & G. Foster (Eds.), Learning machine translation (pp. 59–78). Cambridge: MIT Press—Advances in Neural Information Processing Systems Series (NIPS).

  • Pouliquen, B., Steinberger, R., & Best, C. (2007). Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in natural language processing (RANLP’2007) (pp. 487–492). Borovets, Bulgaria, 27–29.09.2007.

  • Shah, R., Lin, B., Gershman, A., & Frederking, R. (2010). SYNERGY: A named entity recognition system for resource-scarce languages such as Swahili using online machine translation. In Proceedings of the second workshop on African language technology (AfLAT), Malta, 9 July 2010.

  • Sproat, R., Roth, D., Zhai, C., Benmamoun, E., Fister, A., Karlinsky, N., et al. (2005). Named entity recognition and transliteration for 50 languages. Keynote address at the second midwest computational linguistics colloquium, 14–15 May 2010, The Ohio State University.

  • Steinberger, R. (2011). A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation Journal, Special issue on LREC’2010.

  • Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., & Yangarber, R. (2008b). Text mining from the web for medical intelligence. In F. Fogelman-Soulié, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 295–310). Amsterdam, The Netherlands: IOS Press.

  • Steinberger, R., Pouliquen, B., & Ignat, C. (2008a). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam, The Netherlands: IOS Press.

  • Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An Introduction to the Europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual world. Proceedings of SIGIR-CLIR (pp. 1–8). Boston, USA. 23 July 2009.

  • Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. Advances of Neural Information Processing Systems, 15, 1473–1480.

    Google Scholar 

  • Wactlar, H. (1999). New directions in video information extraction and summarization. In Proceedings of the 10th DELOS workshop (pp. 1–10). Sanorini, Greece.

  • Wentland, W., Knopp, J., Silberer, C., Hartung, M. (2008). Building a multilingual lexical resource for named entity disambiguation, translation and transliteration. In Proceedings of LREC (pp. 3230–3237). Genoa, Italy.

  • Yarowski, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st international conference on Human Language Technology research (HLT) (pp. 1–8). Stroudsburg, PA, USA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ralf Steinberger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinberger, R., Ombuya, S., Kabadjov, M. et al. Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili. Lang Resources & Evaluation 45, 311–330 (2011). https://doi.org/10.1007/s10579-011-9155-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9155-y

Keywords

Navigation