1 Introduction

In Information Retrieval (IR) and Natural Language Processing (NLP), a Question Answering (QA) system is a tool that provides an accurate answer to a user’s questions (Hirschman and Gaizauskas 2001).

Monolingual QA systems usually accept a question expressed in natural language, searching for the answer in a collection of documents that are written in the same language as the question (Voorhees 1999; Webber and Webb 2010). Cross-lingual question answering (CLQA) systems accept questions in one language (the source language) and search for answers from documents written in another language (the target language) (Aceves-Pérez et al. 2007; Peñas et al. 2010; Pérez et al. 2009). CLQA systems allow users to have access to more information in a more appropriate way compared to monolingual systems. However, these systems also introduce additional issues because of the language barrier. In addition, similar to monolingual QA systems, CLQA systems manage collections formed with documents written using only one language. For CLQA, most systems translate a question or question keywords into the target language and then apply a monolingual QA approach to find answers in the corpus (Lin et al. 2005).

A fully multilingual QA system manages information in several languages, at the query level (several languages are available to formulate the query) and at the document level (answers are extracted from a multilingual collection). There are two common approaches (Aceves-Pérez et al. 2008), which are merging passages and merging answers.

Multilingual QA systems based on merging passages perform multilingual information merging at the passage level and require Cross Language Information Retrieval (CLIR) strategies to retrieve a set of relevant passages or documents. A typical strategy used in implementations of CLIR systems is the so-called “query translation approach”. In this strategy, the user query is translated into each language presented in the multilingual collection to compute an independent monolingual IR process per language (Hull and Grefenstette 1996; Savoy 2004). Thus, this approach divides documents according to their language. In this way, the CLIR system obtains as many different collections as there are languages. After searching in the corpora and obtaining a resulting list of relevant documents per language, the CLIR system must merge these documents to provide a single list of retrieved articles for the user.

Multilingual QA systems based on merging answers manage multilingual information merging at the answer level (Ko et al. 2010). In this case, the strategy is to perform the complete QA process independently for each language and, after that, to integrate the sets of answers into one single ranked list. This approach does not require a CLIR subsystem to work with the multilingual collection but instead manages every collection with a monolingual or bilingual QA system in which the IR process is monolingual. The main problem with this approach is the requirement for a full QA system for each language available in the multilingual collection.

Our goal here is to develop a multilingual QA system based on merging passages, called BRUJA. Footnote 1 This system is able to manage a collection of English, French and Spanish documents and accepts queries written in any of those languages. The addition of new languages to BRUJA is easy but requires intensive application of translation resources and CLIR techniques, which could reduce the performance of the system. We investigate when such an approach overcomes its monolingual QA counterpart, i.e., is the system performance robust with respect to the language and the type of query? How many questions are answered in other languages instead of in the query language? How many questions are answered in a monolingual scenario and are not answered in the multilingual scenario?

The remainder of this paper is organised as follows: In the Sect. 2, related work is presented. Section 3 presents a revision of the classical architecture of a QA system, and the BRUJA architecture is described. In Sect. 5, BRUJA results are compared to their monolingual counterpart. Section 6 presents a study of the performance of the proposed system for each supported language (English, French and Spanish). The aim of this study is to test how the system is affected with different query languages, e.g., does BRUJA improve the monolingual performance for each query language? Section 7 reports experimentation on searching for answers in other languages, different from the query language. This experimentation involves measuring the improvement that the system obtains because it manages a multilingual document collection. The last group of experiments is depicted in Sect. 8, and concerns the performance of BRUJA when providing answers in different question categories. The last section contains some concluding remarks and proposals for further development.

2 Related work

QA research has been associated with IR since 1995. There are conferences that concern QA, which establish a common framework to evaluate alternative systems. In 1999, the TREC conference Footnote 2 presented the TREC QA track (Voorhees 1999), to evaluate open domain QA systems. Until 2007, this conference has changed some of the parameters, such as the question types and the lengths of the answers. In 2003, the CLEF conference Footnote 3 introduced the pilot task CLEF@QA (Magnini et al. 2004). Starting in 2003 and until 2010, this task has been changed. In 2004, a cross-lingual subtask was presented, where the questions were formulated in a language different from that of the news collection. In 2005, nine source languages and ten target languages were added [eight monolingual and 73 bilingual possibilities (Vallin et al. 2005)]. This year, queries with temporal restrictions were introduced. In 2006, the systems worked with a new type of query, the list query, and all answers were required to be enclosed with a snippet that justified the answer. In 2007, questions were grouped into clusters with a common context. Since 2009, the collections changed, including a new collection based on Wikipedia and another collection based on European Legislation. The aim of the Initiative for the Evaluation of XML retrieval (INEX) is to provide means, in the form of a large XML test collection and scoring methods, for the evaluation of different XML retrieval systems. INEX QA track Footnote 4 (QA@INEX) aims to evaluate a complex question-answering task using Wikipedia. This track offers an evaluation framework that combines QA, passage retrieval and automatic summarising by passage extraction, and requires complex questions to be answered by several sentences or by an aggregation of texts from different documents (Bellot et al. 2010). None of these conferences or workshops works in a real multilingual mode because, although several languages can be used, the systems work with only one language collection; thus, it is not necessary to deal with the usual multilingual IR problems. To our knowledge, only the group of investigators at the Laboratory of Language Technologies of the Computational Sciences Department at the National Institute of Astrophysics, Optics and Electronics (INAOE) has developed and tested a fully multilingual QA system (Aceves-Pérez et al. 2007, 2008). This group proposed a system based on merging the answers returned by a monolingual QA system. They perform the complete QA process independently for each language and then a module integrates the sets of answers into one single ranked list. As described previously, this approach does not require a CLIR subsystem to work with the multilingual collection.

3 Architecture overview

3.1 Question answering architecture

In this section, the classical architecture of a QA system is introduced, and we present a brief description of each module. Then, the general architecture of the BRUJA system is described and fusion algorithms are used to merge relevant lists of monolingual results.

3.1.1 Classical architecture

A classical architecture of a monolingual QA system includes the following three basic modules (Hovy et al. 2000; Moldovan et al. 2000):

  • Question processing. Question processing includes several tasks, for example, question parsing, question classification and question analysis. Information such as the type of question, the expected answer type, and the focus of the question, keywords, or entities are extracted.

  • Indexing. Collection is indexing; queries are run against an index, and relevant documents or passages are returned (the documents having the greatest probability of containing answers).

  • Answer processing. Candidate answers are identified and ranked and the final answers are returned.

3.1.2 BRUJA architecture

Figure 1 shows the general architecture of the multilingual QA system BRUJA. The idea of this approach is to perform in parallel the recovery of relevant passages from all collections (i.e., from all different languages), and then, to integrate these passages into one single ranked list, and finally, to extract answers from the combined set of passages, the multilingual list.

Fig. 1
figure 1

Multilingual architecture of BRUJA QA system

Based on the classical QA architecture, BRUJA analyses questions and implements an automatic question classifier (QC). In the first module, each non-English question is translated using SINTRAM, and these English questions are classified and analysed to extract named entities (real things or instances in the world that are themselves natural and notable class members of subject concepts). We implemented the automatic QC described in (García-Cumbreras et al. 2006), which is based on machine learning and multiple features (lexical, syntactic and semantic) extracted using GATE. Footnote 5 This module was trained with a set of 5,500 questions that were available from USC (Hovy et al. 1999), UIUC and TREC. The dataset was labelled manually by the UIUC group by means of the following general and detailed categories [proposed in Li and Roth (2002)]:

  • ABBR: abbreviation, expansion.

  • DESC: definition, description, manner, reason.

  • ENTY: animal, body, color, creation, currency, disease/medical, event, food, instrument, language, letter, other, plant, product, religion, sport, substance, symbol, technique, term, vehicle, word.

  • HUM: description, group, individual, title.

  • LOC: city, country, mountain, other, state.

  • NUM: code, count, date, distance, money, order, other, percent, period, speed, temperature, size, weight.

Features were identified using the GATE recogniser, Footnote 6 and GATE was also used to recognise the named entities in the questions. The IR systems Lemur Footnote 7 (documents retrieval) and JIRS Footnote 8 (passages retrieval) are used to obtain a list of relevant documents or passages, with Okapi as the weighting function (Hancock-Beaulieu and Jones 1998; Robertson and Walker 1999) and with Pseudo-Relevance Feedback (PRF) (Salton and Buckley 1990).

In addition, a multilingual QA system must include two new modules, one for automatic translation and another for relevant information merging. The purpose of the first module is to translate the input question to all target languages and to translate the relevant information to the pivot language to extract the answers, whereas the second module is intended to integrate the information extracted from the monolingual IR system into one single ranked list. Both modules are described in the following sections.

3.2 Translation module

The translation module is designed to improve the overall performance but is not a core issue of the proposed model. In fact, the architecture of BRUJA is largely independent of the translation algorithm because translators are heuristic. Thus, the choice of each translator has been empirical; a translator is chosen when it obtains better results for the retrieval of relevant documents. We attempt to improve the translation from the point of view of IR, where the information unit usually is the word, more often than the sentence. Thus, we expand the initial translation of the translator by adding non-translated entities or more than one translation for each word by using dictionaries or the rest of the available translators. These strategies are very usual in the CLIR literature (Adriani 2002; Martínez-Santiago et al. 2005). Our own translation module, called SINTRAMFootnote 9 (García-Cumbreras 2009), works with several automatic online and free translators. Based on previous IR experiments and results, we have set a default translator for each pair of languages. These default translators are the following:

The system implements translation strategies to obtain different translations for the same question. These strategies are the following:

  1. 1.

    S1: The translation returned by the default translator.

  2. 2.

    S2: Given a phrase, the different translations of each word obtained by applying all of the translators.

  3. 3.

    S3: S1 plus the original recognised entities (non translated).

  4. 4.

    S4: S1 plus the entities recognised translated with the default translator, replacing the original entities.

  5. 5.

    S5: S1 plus the translation of names and verbs made by the electronic dictionary Freedictionary. Footnote 13

  6. 6.

    S6: S1 plus the most frequent words from the other translators.

Figure 2 shows an example of the translation strategies for a question. Empirical monolingual results for the same question set (Spanish questions as source) show that, in general, the best strategy was the last one (S6).

Fig. 2
figure 2

Example of the translation module

3.3 Information merging

There are several approaches used to merge the relevant monolingual list retrieved by the IR system: the classical Round-Robin and Raw-Scoring (Callan et al. 1995; Voorhees et al. 1995), the Normalized Raw-Scoring (Powell et al. 2000) or other methods based on machine learning such as Logistic Regression (Calve and Savoy 2000). In any case, the merging algorithm decreases the precision of the multilingual system (depending on the collection, between 20 and 40%) (Savoy 2002).

BRUJA implements the classical Round-Robin, Raw-Scoring and 2-step RSV (Martínez-Santiago et al. 2006a) merging algorithms. Methods based on machine learning, such as logistic regression, are not implemented because they require training data that are not available for the framework used. Previous studies in multilingual environments demonstrate that our method, 2-step RSV, obtained the best results (Martínez-Santiago et al. 2006a, b). These three approaches are described as follows:

  • Round-Robin. The documents are interleaved according to the ranking obtained for each document by means of monolingual IR processing. Thus, given a multilingual collection and N languages, the first document for each monolingual retrieval list will constitute the first M documents (M is a variable number), the second document of each list will constitute the next M documents, and so on. In this case, the hypothesis is that there is a homogeneous distribution of relevant documents across the collections. This merging strategy decreases the precision of the results by approximately 40% (Callan et al. 1995; Voorhees et al. 1995).

  • Raw-Scoring. This method produces a final list sorted by document score computed independently for each monolingual collection. The method works excellently whether each collection is searched by the same or a similar search engine and whether or not the terms of the query are distributed homogeneously over all of the monolingual collections. Heterogeneous term distributions will generate query weights that may vary widely among collections (Dumais 1994), and therefore, this phenomenon may invalidate the raw-score merging hypothesis.

  • The basic 2-step RSV (Martínez-Santiago et al. 2006b) idea is straightforward, as follows: Given a query term and the term translations to the other languages, the document frequencies of the term and their translations are grouped together. In this way, the method requires a recalculation of the document score by changing the document frequency for each query term. Given a query term, the new document frequency is the summation of the mono-lingual document frequency for the term and its translations. The re-indexing of the whole set of the documents of the multilingual collection could be computationally expensive; thus, given a query, only the retrieved documents for each monolingual collection are re-indexed. Perhaps the strongest constraint for this method is that every term in the query must be aligned with all of its other translations. However, this information is not always available. For example, translation techniques make word-level alignment of the queries difficult. Because BRUJA uses Machine Translation (MT) resources, a word-level alignment algorithm is required. We use the algorithm depicted in (Martínez-Santiago et al. 2006a). Briefly, for each translation, the algorithm works as follows [a more detailed description is available in Martínez-Santiago et al. (2004)]:

    1. 1.

      Let the original phrase be in English. This phrase is translated into the target language with a MT resource.

    2. 2.

      Unigrams and bigrams are extracted from the English phrase. Both are translated with the same MT resource used in 1.

    3. 3.

      Stopwords are removed (conjunctions, prepositions, articles and other words that appear often in documents but that typically contain little meaning). Non-stopwords are stemmed (stemming is a process that reduces inflected words to their stem).

    4. 4.

      The alignment of terms is tested by matching terms into the translated phrase with the translation based on unigrams. Note that the translation based on unigrams is fully aligned. Thus, if a word of the translated phrase is translated in the same way, then we know the translation of the word in the translated phrase. Thus, this word is aligned.

    5. 5.

      After this alignment is finished, if any term in the translated phrase is not aligned, then we use bigrams with exactly one term aligned to align with the other term in the bigram.

For the 200 Spanish questions used in the following experiments, the percent of aligned non-empty words was 91%, and the algorithm obtained 87% in the case of 200 French questions.

With respect to dealing with queries that are partially aligned, a straightforward and effective way to partially solve this problem is by taking non-aligned words into account locally, as terms of a given monolingual collection. Thus, given a document i, the weight of a non-aligned term is the initial weight calculated in the first step of (1).

$$ RSV_i^{\prime} = \alpha\cdot RSV_i^{align}+(1-\alpha)\cdot RSV_i^{nonalign} $$
(1)

where RSV is the Retrieval Status Value, \(RSV_i^{align}\) is the score calculated by means of aligned terms, such as the original 2-step RSV method depicts. On the other hand, RSV nonalign i is calculated locally. Finally, α is a constant (usually fixed to α = 0.75).

4 Experimental method

In this section, we briefly describe the framework of the experiments.

The collections used in the mono and multilingual experiments were provided by the CLEF organisation,Footnote 14 and they are summarized in Table 1. The multilingual collection is made up of the union of the three monolingual collections depicted in Table 1 and and these collections are the only collections used both in the monolingual and the multilingual experiments. The following information is shown:

  • Collection: language and name of the collection

  • Year: year of the collection

  • Size: in megabytes

  • Docs: total number of documents

  • SizeDoc: average of size per document

  • WordDoc: average of number of words per document

Table 1 Description of the collections sets

The scenario is set up off-line, as follows:

  1. 1.

    A set of 200 questions was provided by the CLEF&QA organisation in 2006 (Magnini et al. 2006). These 200 questions were provided in Spanish (SPQ), English (ENQ) and French (FRQ). The original language of the questions is Spanish and the other sets are manual translations made by CLEF translators. We have made a manual partition of the questions according to the classes of the answers. The goal of this partition is to evaluate the performance of the BRUJA system for each class. The sets of questions by class are the following:

    • Definitional questions about an entity or an acronym. For instance, “What is Atlantis?” or “What does ONU mean?”

    • Definitional questions about people. For instance, “Who is Danuta Walesa?”

    • Factual questions about a location with dates. For instance, “What country did Irak invade in 1990?”

    • Factual questions about a numeric value. For instance, “How many countries are in NATO?”

    • Factual questions about an organisation. For instance, “What organisation does Yaser Arafat lead?”

    • Temporal questions with entities and dates. For instance, “Name a film where Kirk Douglas appear between 1946 and 1960”

    • Factual questions about dates. For instance, “When Stalin died?”

    • Factual questions about a location, without dates. For instance, “Where is The Cairo?”

    • List queries. For instance, “Name countries that extract crude oil.”

    • Other factual questions. For instance, “What is the longest German word?”

  2. 2.

    The translation module used was SINTRAM (described below).

  3. 3.

    The evaluation of the results was performed in terms of the Mean Reciprocal Rank (MRR) (Voorhees 1999), the statistics value used for evaluating a set of possible answers to a query, ordered by their probability of correctness. The reciprocal rank of a query answer is the multiplicative inverse of the rank of the first correct answer. The MRR is the average of the reciprocal ranks of the results for a sample of queries Q. The accuracy is also used in the evaluation of a QA system (as the proportion of correct answers in the first position, with only one answer per question, to the total number of answers). The MRR evaluation method is shown in (2)

$$ MRR = \frac{1}{|Q|}\sum_{i=1}^{|Q|} \frac{1}{rank_i} $$
(2)

where Q is the sample set of queries and rank i is the rank of the first correct answer. A typical QA system returns five possible answers for each question. The Accuracy value is calculated as it is shown in  (3).

$$ Accuracy = \frac{ncafp}{nra} $$
(3)

where ncafp is the number of correct answers in the first position and nra is the number of returned answers.

We also evaluate the number of correct answers, taking into account the first five answers returned.

Monolingual experiments performed are the following:

  • MONO_ES_ES: the set of monolingual Spanish questions against the Spanish collection.

  • MONO_EN_EN: the set of monolingual English questions against the English collection.

  • MONO_FR_FR: the set of monolingual French questions against the French collection.

With the aim of testing the performance of the translation module, we made some bilingual experiments, using questions in different languages against the English collection.

These bilingual experiments are the following:

  • BI_ES_EN: the set of Spanish questions against the English collection.

  • BI_FR_EN: the set of French questions against the English collection.

Finally, the multilingual experiments are the following:

  • MULTI_ES_RR: the set of Spanish questions, and its translation into English and French, against all of the collections (Spanish, English and French). The fusion method used was Round-Robin.

  • MULTI_ES_RS: the set of Spanish questions, and its translation into English and French, against all of the collections (Spanish, English and French). The fusion method used was Raw-Scoring.

  • MULTI_ES_2STEP: the set of Spanish questions, and its translation into English and French, against all of the collections (Spanish, English and French). The fusion method used was 2-step RSV.

  • MULTI_EN_2STEP: the set of English questions, and its translation into Spanish and French, against all of the collections (Spanish, English and French). The fusion method used was 2-step RSV.

  • MULTI_FR_2STEP: the set of French questions, and its translation into Spanish and English, against all of the collections (Spanish, English and French). The fusion method used was 2-step RSV.

The evaluation of the results was made manually by three people, who checked whether any of the five returned answers per question are correct and identified the position of the correct answer.

5 Overall evaluation of BRUJA

In this section, we present the general results obtained with the BRUJA QA system. The goal is to examine the general behaviour of the system when comparing mono and multilingual experiments and to examine the influence of some of the BRUJA modules. Our aim is not to compare our results with the results presented at the CLEF&QA 2006 task because these systems and results are not multilingual; instead, we want to compare our system when it is working with mono versus multilingual collections. In any case, Table 2 shows the best results obtained for the monolingual Spanish task at CLEF&QA 2006 and the first raw entry shows our monolingual result (Magnini et al. 2006).

Table 2 Summary of the Spanish monolingual results at CLEF&QA 2006

5.1 Monolingual versus multilingual performance

The goal in this section is to evaluate whether the multilingual QA system developed introduces more correct answers and whether those answers appear in the first position of the retrieved list, compared to results with the monolingual systems. We also evaluate the influence of some of the BRUJA modules.

Table 3 shows the global mono and multilingual results with Spanish as the source language. The second column presents the number of correct answers for the 200 questions, in any of the first five positions. The last column is the percentage of improvement in the MRR value, with the monolingual Spanish experiment as the baseline case (MONO_ES_ES). Answers come from different documents and, in the multilingual cases, from different collections.

Table 3 Summary of mono and multilingual results (Spanish as source language)

A first analysis can be established between the monolingual system (MONO_ES_ES), as a baseline case, and the multilingual systems that use the fusion methods Round-Robin (MULTI_ES_RR) and Raw-Scoring (MULTI_ES_RS). The Round-Robin did not improve the baseline case. However, Raw-Scoring obtains a value that is 25.6% lower than the value for MRR. There are not significant differences between the numbers of correct answers, but the positions of the correct answers are worse, so the MRR values decrease. An appropriate QA system not only obtains more correct answers but the correct answers appear in the first positions of the retrieved lists.

The main analysis is established between the monolingual baseline case and the multilingual case that uses as fusion method 2-step RSV (MULTI_ES_2STEP), which is our fusion method implemented in the BRUJA QA system. Multilingual experiments disclose the importance of the merging algorithm for managing a multilingual collection. The multilingual BRUJA system outperforms the monolingual baseline case, improving on the Spanish monolingual case by 15% in terms of MRR. Some reasons for this improvement are the following:

  • BRUJA finds more correct answers. The multilingual case finds 77 correct answers, and these answers come from all of the collections. Some of the correct answers appear also in the monolingual experiment, but the multilingual system finds new answers in other languages.

  • BRUJA finds correct answers in the first attempt. Most of the correct answers are in the first position, so the MRR and Accuracy values are similar. The main reason for this pattern is that the correct answers appear more times in the collections (in all of the languages) and, as a result, their final score value is increased.

  • The noise introduced is minimal. Working with several collections from different languages could introduce noise into the system because of the translation process, but the multilingual system does not introduce many mistaken answers.

To compare the baseline case (MONO_ES_ES) to the best multilingual case and to the fusion methods that were applied, we have run the Wilcoxon test (Wilcoxon 1945). This test was developed to analyse data from studies that have repeated-measure and matched-subject designs. This test is used to determine the differences between groups of paired data when the data do not meet the rigor associated with a parametric test. The following two null hypotheses were tested:

  • The first test was to compare whether the results of two experiments (in terms of the Reciprocal Rank, or rr, of the 200 questions) were similar

  • The second test was to compare whether the result of one experiment was better than the other experiment

Table 4 presents the p values obtained in the third and fourth columns. Values lower than 0.05 confirm the statistical significance of each hypothesis. p values are written with four significant values after the decimal point to confirm the statistical significance of each hypothesis. Based on these p values, we conclude that the multilingual experiment MULTI_ES_2STEP is different and better than the monolingual baseline case (MONO_ES_ES) and the other multilingual experiments (MULTI_ES_RS, MULTI_EN_2STEP and MULTI_FR_2STEP).

Table 4 Wilcoxon test results over monolingual versus multilingual experiments

6 Performance for each language

An interesting experiment was to run the monolingual and multilingual system with the fusion method 2-step RSV, with the set of 200 questions of each language. The aim was to test how the system is affected by a different source language (Spanish, English and French).

Table 5 shows the results obtained with each monolingual and multilingual experiment (with the fusion method 2-step RSV), changing the source language of the set of questions. The last column presents the difference between the MRR for each monolingual and its corresponding multilingual with the same source language (in percentage). As previously described, the original language of the questions is Spanish and the other sets are manual translations made by CLEF translators.

Table 5 Summary of monolingual and multilingual results with 2-step RSV, changing the source language of the questions

A first analysis of the results discloses that the best monolingual experiment was the Spanish experiment (MONO_ES_ES), in terms of correct answers, MRR and Accuracy. French and English monolingual experiments obtain results with a decrease in the MRR of 11.6%. French monolingual cases obtain a few correct answers, as a consequence of the most complex case, in which the language is not the original language of the questions or the pivot language of the BRUJA system.

BRUJA obtains a high performance with every language source and, with the fusion method 2-step RSV, BRUJA improves both the monolingual results (between 12 and 31%) and the other multilingual runs that work with other fusion methods. The Spanish question set obtains the best performance, which is expected because it is the original language of the 200 questions. With the English questions, the MRR decreases by 16.76%. In terms of correct answers, the values obtained are similar, but the correct answers are in a lower position.

This behaviour of the BRUJA system makes it a stable and robust system with different possible source languages. All multilingual experiments with the 2-step RSV improve on the monolingual results.

To compare each monolingual experiment with the multilingual experiment, using the fusion method 2-step RSV and the same source language, the Wilcoxon test was run. Table 6 presents the p values obtained. We conclude that Spanish and English multilingual experiments are different and always better compared to the monolingual ones. French is not the original language of the questions nor the pivot language of the BRUJA system, so the improvement of the results is not statistically noteworthy.

Table 6 Wilcoxon test results over monolingual versus multilingual experiments with the same source language

7 Searching answers in other languages than the query one

After the analysis of the previous results, we extracted some statistical values with respect to the correct and incorrect answers, comparing the mono and multilingual experiments that use the same source language. The goal was to compare the number of new answers that the multilingual system introduces to the number of lost answers, those answers extracted by the monolingual system that the multilingual system did not find. Table 7 shows this comparison, with the same source language. All multilingual cases use the fusion method 2-Step RSV. We think that the most interesting result is that the number of correct answers obtained exclusively by the multilingual system is better than the number of correct answers obtained exclusively by the monolingual system. Thus, for the Spanish query set, the multilingual system MULTI_ES_2STEP obtains 27 correct answers whereas the monolingual counterpart does not (MONO_ES_ES). On the other hand, MONO_ES_ES obtains only 18 correct queries where the multilingual system fails. This proportion is similar for English (28/16) and French (27/11). In short, the number of new correct answers extracted by the multilingual system is roughly double the number of lost answers.

Table 7 Comparison of correct and incorrect answers between mono and multilingual systems, with the same question set as input

Further analysis was performed to demonstrate that BRUJA finds correct answers in collections different from the source language, and to demonstrate that BRUJA finds correct answers with each language as a source of the questions posed. To accomplish this analysis, we ran two experiments, as follows:

  1. 1.

    Working with English questions, how many correct answers were not found in the English collection? Because English is the pivot language of the BRUJA system, we could think that BRUJA would not find answers in French and Spanish because of the noise of the translation process.

  2. 2.

    Working with the French collection, how many correct answers were found, with any language for the questions? French is the most difficult language for BRUJA, because French is neither the pivot language of the BRUJA system nor the original language of the 200 questions. French appears to be the worst case for the BRUJA system.

It is important to point out that in both experiments the system finds exclusive answers, which are those answers returned by the French and Spanish collection in the first case and only in the French collection in the second case. These answers only appear in a collection different from the English collection. BRUJA, in general, finds the same answers in many documents that are in any of the three collections used.

Table 8 presents these statistical results, based on the experiment MULTI_EN_2STEP. The last column shows the percentage of correct answers per language or in a group of languages, regarding the total number of correct answers. More than half of the correct answers have been extracted from the English collection (which does not mean that they did not appear in the other languages). The main conclusion is that 35% of them have not been extracted from English documents. These correct answers cannot be extracted with a mono or bilingual QA system working with the English collection.

Table 8 Right answers per language, with the multilingual experiment MULTI_EN_2STEP

Table 9 shows, for each multilingual experiment with the fusion method 2-Step RSV, the number of correct answers obtained in the French collection. The last column shows the percentage of correct answers per language, regarding the total number of correct answers. The results show that, with English as a source language, more than 15% of the correct answers are extracted exclusively from the French collection. Remember that the original queries are designed for the Spanish collection (more answers in Spanish) and BRUJA uses English for usual Q&A tasks such as query classification and answer extraction. Thus, French is the hardest language for BRUJA. However, even for French, BRUJA obtains a remarkable 15.87% of its answers exclusively from the French document collection. This result means that such answers are found neither in the English nor in the Spanish document collections. In short, if the French collection is eliminated, then BRUJA will find 15.87% fewer correct answers.

Table 9 Right answers from French collection, with the multilingual experiments that use the fusion method 2-Step RSV

8 Evaluating BRUJA by query category

To check the detailed performance of the BRUJA system, the source set of questions has been divided manually into question categories, with the following classes:

  1. 1.

    Factual questions (Fac)

  2. 2.

    Definitional questions (Def)

  3. 3.

    Other questions (Oth)

Likewise, the questions were classified into the following detailed categories, described in Sect. 3 above:

  • Definitional questions about an entity or an acronym (DEF_ENT_ACR).

  • Definitional questions about people (DEF_PERS).

  • Factual questions about a location with dates (FACT_LOC_FEC).

  • Factual questions about a numeric value (FAC_NUM).

  • Factual questions about an organization (FAC_ORG).

  • Temporal questions with entities and dates (TEMP_ENT_FEC).

  • Factual questions about dates (FAC_FEC).

  • Factual questions about a location, without dates (FAC_LOC).

  • List queries (LIST).

  • Other factual questions (FAC_OTR).

Table 10 shows the results obtained with the monolingual experiment MONO_ES_ES (Spanish questions set), using the general and detailed groups of questions.

Table 10 MONO_ES_ES: results per general and detailed categories of questions

We emphasize that:

  • Most of the questions are factual, consisting of 117 out of a total of 200 (58.5%). Most of the questions in a detailed classification ask about numeric values (FAC_NUM), dates (FAC_FEC) and locations (FAC_LOC). Factual questions obtain the main MRR of 0.324, where only the questions that ask about organisations obtain poor results in the detailed classification. There is little difference between MRR and Accuracy, because of the position of the correct answers

  • There are 67 definitional questions (33.5%), most of them asking about people (DEF_PERS), entities and acronyms (DEF_ENT_ACR). These questions have a global result of MRR = 0.305 and have similar detailed results. We can see that there is no difference between MRR and Accuracy (most of correct answers are in the first or second position).

  • Other results are not comparable because of a low number of questions. The detailed groups with fewer than 10 questions are not important in the analysis of the results.

Table 11 presents the results obtained with the multilingual experiment MULTI_ES_2STEP (Spanish questions set and our fusion method), using the general and detailed groups of questions.

Table 11 MULTI_ES_2STEP: results per general and detailed categories of questions

The analysis of the results, and the comparison between them and the previous monolingual experiments, shows that the improvement of the system is applied over all categories of questions (factual and definitional, general and detailed). MRR and Accuracy values are similar in all of the cases; thus, most of the correct answers are in the first positions of the retrieved lists. Another interesting result is that some questions have only one answer returned from the BRUJA system, which is the correct answer. With the detailed categories, the definitional questions increase their values, and factual questions obtain similar values.

9 Conclusions and further work

In this study, we report a large number of experiments to evaluate BRUJA, a multilingual QA based on merging passages. Previous work (Aceves-Pérez et al. 2008) shows that multilingual QA based on merging answers is slightly better than QA based on merging passages, but neither approach overcomes its monolingual QA counterpart. The first conclusion is that a good information merging algorithm is required to overcome monolingual QA results. More concisely, our 2-step RSV merging algorithm is a very suitable algorithm for multilingual QA based on merging passages.

In addition, we have demonstrated the following:

  • The improvement achieved by the BRUJA system is query language independent. We obtain a consistent improvement with respect to the monolingual QA counterpart when the query language is English (19.3%), French (27%) and Spanish (15.65%). In addition, the overall accuracy of the system is similar for all query languages (accuracy between 0.37 and 0.38).

  • BRUJA uses English as a pivot language exclusively in languages other than English, but BRUJA obtains a remarkable percentage of correct answers in languages other than English even when the query language is English (34.91%).

  • Because French is neither the pivot language nor the original language of the questions, French documents are the most difficult collection to extract correct answers from. Even so, an average of 10.3% of these correct answers is obtained exclusively on the French collection. The other correct answers have been obtained in more than one collection (French and English or French and Spanish or French, English and Spanish collections).

  • The multilingualism of BRUJA works excellently throughout all query types.

An interesting issue about multilingual QA systems is to know when such systems outperform a monolingual QA system. Multilingualism introduces additional processes such as translations and the fusion of the answers. These processes are not perfect and some noise is introduced. The question is when potential new answers in other languages compensate for such noise. We have tried to give an answer, at least partially, since BRUJA has been evaluated by using a particular multilingual collection, the CLEF collection. This collection is somewhat a multilingual comparable corpus. In other words, given a question, documents containing correct answers are not fully balanced across the monolingual collections but it varies between the collection of each language. Note that this is a very usual scenario but it is not the only one possible. If the collections are not comparable, thus is, the collections are about very different topics, it’s clear that a monolingual system will only find answers for some topics. At the opposite end there are very similar collections, such as parallel corpora. This means, there is no supplementary information in the other collections of individual languages. For such collections, does a multilingual method work? It is very possible that the monolingual system outperforms the multilingual system because of the noise the translations to the Interlingua and the CLIR module. Anyway, we think that it depends on the architecture of the multilingual system and the resources that such multilingual system manages for each language. For example, if we assume that the multilingual system is made up by a monolingual QA system for each language and the resources available for each language are very heterogeneous, then, it is possible that even given a parallel corpus, such a system could obtain new answers in other languages because of the different skills of every monolingual system. Anyway this is not the case of the proposed architecture since BRUJA uses an Interlingua-based approach better than a monolingual QA system for each language. We conclude that a BRUJA-like multilingual systems are probably better than a monolingual system whenever the multilingual corpora is made up by monolingual corpus with heterogeneous information. Even such corpus could be somewhat comparable, such as the CLEF corpora.

For further developments, we want to investigate the following topics:

  • Working with more languages. It is not very complex to add new languages to the system. The only requirement is the availability of a state-of-art translator to English. Note that a simplistic machine translator could introduce more incorrect answers than correct answers.

  • Searching answers in an off-line step. Some current QA systems preprocess a collection to obtain factual and definitional answers, with a certain type or pattern, storing them in a database. It could be possible to perform this process with BRUJA on the English documents, but we do not use an answer extraction module for each language.

  • Improving BRUJA modules. Each module developed in BRUJA can be improved with more and better resources.

  • Dealing with different types of questions. New modules could be developed to manage temporal questions, combined list questions, or questions that require knowledge.