Multilingual event extraction for epidemic detection

https://doi.org/10.1016/j.artmed.2015.06.005Get rights and content

Highlights

  • We present DANIEL, a multilingual system for tele-epidemiology.

  • Classical approaches use language-dependent resources, which limits coverage.

  • Our system uses very few resources and performs consistently on any language.

  • It has therefore worldwide coverage and better reactivity than the state-of-the-art.

  • Readily dealing with any language implies faster epidemic detection.

Abstract

Objective

This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed for common disease names exists.

Methods

The Daniel system presented herein relies on properties that are common to news writing (the journalistic genre), the most useful being repetition and saliency. Wikipedia is used to screen common disease names to be matched with repeated characters strings. Language variations, such as declensions, are handled by processing text at the character-level, rather than at the word level. This additionally makes it possible to handle various writing systems in a similar fashion.

Material

As no multilingual ground truth existed to evaluate the Daniel system, we built a multilingual corpus from the Web, and collected annotations from native speakers of Chinese, English, Greek, Polish and Russian, with no connection or interest in the Daniel system. This data set is available online freely, and can be used for the evaluation of other event extraction systems.

Results

Experiments for 5 languages out of 17 tested are detailed in this paper: Chinese, English, Greek, Polish and Russian. The Daniel system achieves an average F-measure of 82% in these 5 languages. It reaches 87% on BEcorpus, the state-of-the-art corpus in English, slightly below top-performing systems, which are tailored with numerous language-specific resources. The consistent performance of Daniel on multiple languages is an important contribution to the reactivity and the coverage of epidemiological event detection systems.

Conclusions

Most event extraction systems rely on extensive resources that are language-specific. While their sophistication induces excellent results (over 90% precision and recall), it restricts their coverage in terms of languages and geographic areas. In contrast, in order to detect epidemic events in any language, the Daniel system only requires a list of a few hundreds of disease names and locations, which can actually be acquired automatically. The system can perform consistently well on any language, with precision and recall around 82% on average, according to this paper's evaluation. Daniel's character-based approach is especially interesting for morphologically-rich and low-resourced languages. The lack of resources to be exploited and the state of the art string matching algorithms imply that Daniel can process thousands of documents per minute on a simple laptop. In the context of epidemic surveillance, reactivity and geographic coverage are of primary importance, since no one knows where the next event will strike, and therefore in what vernacular language it will first be reported. By being able to process any language, the Daniel system offers unique coverage for poorly endowed languages, and can complete state of the art techniques for major languages.

Introduction

Information extraction (IE) aims at extracting structured views from text and particularly from newswires that provide instant information from a large number of sources. The European Media Monitor for instance collects about 40,000 news reports written in 43 languages everyday.1 This information context provides a new opportunity for health authorities, needing to monitor information, placing emphasis on disease outbreaks and spreadings [1].

However, natural language processing historically puts a very strong emphasis on vocabulary and on differences between languages, to the extent computational models heavily rely on the constitution of lexical resources. Special effort has been exerted to collect specialized medical lexica. Therefore, although web news is available in a large number of languages and dialects, the standard pipeline in IE is designed for texts in standard English, with the need to add lexicon and special components (lemmatizer, parser) each time a new language is added. Meanwhile, disease outbreaks ignore national frontiers and when considering epidemiological event extraction (EE), one has to detect diseases from health-related news in many languages to send alerts to health authorities as quickly as possible [2].

Keller has compared existing systems [3] stressing their complementarity. In the same way, the Data Analysis for Information Extraction in any Language (Daniel) system fulfills part of the needs but not all. The strong points advocated here are quick access to new languages, very light programming needed and timeliness in IE [4]. It is also important to get a leveraged epidemiological EE, so as to detect events from multilingual sources both at the same pace and with similar reliability. Since no multilingual corpus was available for comparison with existing systems, a news corpus has been collected and made available for further tests. The Daniel system is a text-genre based EE system designed to manage multilingual news with a large geographical coverage. Multilingual IE with light resources was tested, in order to quickly detect news denoting concern about some disease. Here, the standard approach to text as a bag of words is replaced by a spatial vision of text. Three characteristics are combined to avoid the chore of constituting heavy resources for all languages. A strong hypothesis assumes the constraints of information and dissemination are common to all news writers, and that journalistic genre implies a common use of titles, headers, bodies and feet, whatever the language. The common structure in news is the rhetorical “spatial” basis for the proposed model. Information is found at a specific place. A similar notion is sometimes used in academic literature analysis [5]. The second characteristic is the implicit use of discourse “time”, a.k.a. narrative line in news, with some typical repetitions along the way. The third characteristic is the use of the news date, linking the event to a given time window in conjunction with a geographical location and a disease. Since the system fills a gap in epidemiological monitoring, experiments were conducted on a multilingual corpus of 17 languages. It was manually annotated for 5 of them (Chinese, English, Greek, Polish and Russian). Precision and recall are computed for document wise and event wise detection. The question is how to compare a light resource system aiming at a wide coverage, while everyone is deeply involved in enriching resources and improving results for a very few number of languages. Whenever possible, results are compared with existing systems, or on common corpora.

The present paper is organized as follows. In Section 2, an overview of the multilingual approaches in IE is provided along with proposals to overcome shortcomings in early detection of diseases. In Section 3, we introduce the Daniel system, a text-genre based EE system designed to manage multilingual news. Section 4 introduces the evaluation corpus that we collected for the experiments. In Section 5 the results are presented and discussed. Finally, the efficiency of such a light approach for filtering huge multilingual news feeds is discussed and future directions are sketched in Section 6.

Section snippets

Background

IE approaches rely mostly on the use of the generic IE chain [6]. Two systems that rely primarily on English, Puls2 [7] and Biocaster3 [8], are well-known examples of classic IE systems specializing in epidemiological EE with good results in English and a few other languages. HealthMap4 [9] is

The Daniel system

The Daniel system presents an implementation of a discourse-level EE approach. It operates at discourse-level by exploiting the global structure of news in a newswire. It harnesses information ordering as defined by Lucas [14], as opposed to the usual analysis at sentence-level (morphology, syntax and semantics). Entries in the system are news texts, including their title and text-body, and the name of the source when available. The only structural information needed are the positions of the

Corpus

To the best of our knowledge, there is no available corpus for the evaluation of multilingual epidemic surveillance. The only corpus available online, BEcorpus,9 is exclusively built with relevant documents (200), making it unsuitable for evaluating the precision of document filtering. The corpus consists of a list of uniform resource locators (URLs) of Web pages compiled in 2009, and of which 102 source documents were still available in

Results and evaluation

This section shows the performance of the repetition rule in salient zones to select relevant press articles. Daniel is first demonstrated through examples, then evaluated quantitatively against annotators’ judgements on the evaluation corpus. The system processes 2000 documents in less than 15 s,14 which is compatible with on-line surveillance.

Objective

The challenge in health surveillance is to ensure world coverage. The current approach is to multiply dedicated systems for each language, but resources are lacking for a very large number of them. The richest state-of-the-art system handles 10 languages, whereas there are about 6000 languages in the world, 300 of which are spoken by more than one million people. The principles of a genre-based IE system called Daniel have been tested on 17 languages and evaluated on 5 languages: Chinese,

References (28)

  • S. Doan et al.

    Global health monitor – a web-based system for detecting and mapping infectious diseases

  • R. Steinberger

    A survey of methods to ease the development of highly multilingual text mining applications

    Lang Resour Eval

    (2011)
  • M. Keller et al.

    Use of Unstructured event-based reports for global infectious disease surveillance

    Emerg Infect Dis

    (2009)
  • G. Lejeune et al.

    Added-value of automatic multilingual text analysis for epidemic surveillance

  • B. Webber et al.

    Discourse structure and computation: past, present and future

  • J. Hobbs

    Generic information extraction system

  • M. Du et al.

    Building support tools for Russian-language information extraction

  • N. Collier

    Towards cross-lingual alerting for bursty epidemic events

    J Biomed Semant

    (2011)
  • C.C. Freifeld et al.

    HealthMap: global infectious disease monitoring through automated classification and visualization of internet media reports

    J Am Med Inform Assoc

    (2008)
  • O. Etzioni et al.

    Open information extraction: the second generation

  • R. Munro

    Processing short message communications in low-resource languages

    (2012)
  • R. Steinberger et al.

    Multilingual media monitoring and text analysis – challenges for highly inflected languages

  • F.-J. Tsai et al.

    Is the reporting timeliness gap for avian flu and H1N1 outbreaks in global health surveillance systems associated with country transparency?

    Glob Health

    (2013)
  • N. Lucas

    Stylistic devices in the news, as related to topic recognition

  • Cited by (0)

    View full text