Keywords

1 Introduction

Our research vision lies in building up an infrastructure in which the state-of-the-art question answering (QA) core components can be easily integrated, run, and evaluated. This vision was triggered by the fact that so far QA community has released a considerable body of research as well as valuable running components accomplishing various QA tasks. To achieve our goal vision, we recently published foundations [3, 22, 25] being essential to solve the interoperability, integration and reusability issues. Initially, we published the qa vocabulary [22] as a flexible and extensible data model for annotating outputs of QA components. Thereafter, we proposed Qanary [3], a methodology for integrating components of QA systems which (i) utilises qa vocabulary for annotation, (ii) is independent of programming languages, (iii) is agnostic to domains and datasets, (iv) integrates components on different granularity levels. Lately, we published the Frankenstein [25], which is concerned with (1) a prediction mechanism to predict the performance of a component given a question and a required task; (2) an approach for composing performance-optimised pipelinesFootnote 1 by integrating the most accurate components for the current QA tasks (i.e. the user’s question). Frankenstein uses Qanary methodology to integrate state-of-the-art QA components within its architecture. However, we disregarded implementation details, reusability, configuration details, integration advantages of Frankenstein in [25], while this paper introduces Frankenstein as an application/platform addressing (a) how to build a new QA pipeline using 29 integrated components (b) how each component can be reused independently (c) how to evaluate the questions/texts. Hence, we disassemble the Frankenstein implementation to present a large set of reusable components from the QA community which can be run, evaluated and compared using the additional tools Frankenstein is offering. In other words, by decoupling Frankenstein architecture, the overall architecture becomes collection of 29 components as reusable resources, which can be either used to build QA pipeline or text analysis. We introduce major modules of Frankenstein which not only enable detecting optimum pipelines but also enable us to easily run, evaluate and compare any configured QA system. Frankenstein integrates 29 QA components by developing an individual wrapper for each component. Thus, end-user does not need to get involved in configuration and implementation details of components. In fact, these components can be directly reused to build QA systems. Consequently, just by using the QA components described in this paper 380 reasonable QA pipelines can be created with little effort. Hence, many new insights w.r.t. the performance of QA might be derived using these components and pipelines which also providing support for analytics as well as adopting additional components.

The contribution of this paper is to release the Frankenstein modules containing two kinds of open-source resources namely (i) reusable components as well as (ii) component-wise runners and evaluators. These resources are briefly described in the following:

  • Reusable QA Components: We collected 29 QA components accomplishing various QA tasks, i.e. named entity identification/recognition (NER), Named Entity Disambiguation (NED), Relation Linking (RL), Class Linking (CL), and Query Builder (QB). Then, we implemented a wrapper for every included component which enables these popular tools to be easily integrated and reused in the Frankenstein framework. Therefore, these components can be used for building modular question answering systems which might analyse text, provide knowledge extraction etc. Furthermore, the wrapper annotates the output of the components using the qa vocabulary to provide machine readability and homogeneity among outputs of all components.

  • Evaluators for components and benchmarks: We have automatised the process of running and evaluating any component integrated within Frankenstein. Thus, it enables evaluating and comparing QA components for individual stages of QA pipeline. Consequently, it is possible to analyse the performance of each QA component as well as of the whole QA systems which lead to completely new insights on the performance of particular QA tasks. Hence, researchers are enabled to easily uncover quality flaws and improve the performance while aiming at existing or novel fields of applicability. The evaluator components are independent of the input benchmark, and it is configurable in easy steps based on the requirement of the user.

This work is substantially impactful for QA and NLP communities because (1) it facilitates comparison of NED, NER, CL, and QB components w.r.t. any given gold standard; (2) it can easily integrate new and upcoming components at any stage of QA pipeline to ensure scalability. Thus, by this platform, the research community will be empowered to an automatic approach which easily reuses the core components and facilitates running and comparing the performance of components over any given benchmarks.

The rest of this paper is organised as follows. Section 2 presents the importance and impact of this work for the research community. Section 3 lists all of the components integrated so far along with their characteristics. The major modules of Frankenstein are presented in Sect. 4. Section 5 presents our plan for availability and sustainability of resources. Section 6 reviews the state-of-the-art and we close with the conclusion and future work in Sect. 7.

2 Broader Impact

Impact on QA Community. Recently, QA community was supported by the modular approaches such as openQA [14], Qanary [3, 23], OKBQA [12], QALL-ME [7], and Frankenstein [25] aiming at integrating and reusing the existing QA core components. Frankenstein is a smart solution on top of Qanary to the limitations observed in the prior approaches. For example, openQA expects Java implementation of the components which is not possible in most of the cases. Also, openQA and QALL-ME have configuration difficulties and its components are not directly reusable in other approaches. More importantly these frameworks do not support a dynamic pipeline methodology. Moreover, the distinguished features united within Frankenstein makes it scalable, user-friendly and fully automatic which are rare in the prior approaches. Apart from these general characteristics, Frankenstein resources make the researchers needless of developing a QA full pipeline. In fact, researchers can focus on improving individual stages of QA pipelines while reusing for other QA tasks to complete their pipeline.

For example, recent work on query builder component [33] has reused results of components for building and evaluating QA pipeline for its empirical study. In this way, QA researchers can focus on independent stages to make it more accurate and intelligent. Furthermore, using the automated process of evaluation (i.e. ) within Frankenstein assists researchers to easily integrate their newly developed component and evaluate its performance against the-state-of-the-art components over any given benchmark.

Impact Beyond QA Community. Although the primary contribution of our work targets the QA community, other disciplines – particularly information extraction (IE) and Natural language processing (NLP) communities – are beneficial of Frankenstein because of the common tasks such as NED, RL, and CL. For example, 11 of NER components and 9 of NER components are integrated into Frankenstein and coupled with tools (i.e.  and ). These components are also utilised in information retrieval and social media analytics for entity recognition and disambiguation on large textual corpora or a tweet corpus. Any given benchmark can be uploaded to therefore possibly, a domain-specific evaluation of performance is published. Enabling these communities to reuse the existing components opens new perspectives for the future steps. For example, there is no meticulous study about the performance details as for where each component is well-performed or what is its pitfalls.

3 Reusable Components and Characteristics

Named Entity Recognition and Disambiguation Components. The aim of the named entity recognition (NER) task is to recognise the entities present in the question and the aim of named entity disambiguation is to link these spotted entities to its knowledge base mentions (e.g. for DBpedia [1]). For instance, in the example question “Who is the mayor of Berlin?”, an ideal component performing NER task recognises Berlin as entity and components for NED task link it to its DBpedia mention dbr:BerlinFootnote 2. The following NER and NED components are now available as reusable resources within Frankenstein.

  1. 1.

    Entity Classifier uses rule base grammar to extract entities in a text [5]. Its REST endpoint is available for wider use for NER task.

  2. 2.

    Stanford NLP Tool: Stanford named entity recogniser is an open source tool that uses Gibbs sampling for information extraction to spot entities in a text [8].

  3. 3.

    The Babelfy is a multilingual, graph-based approach that uses random walks and the densest subgraph algorithm to identify and disambiguate entities present in a text [16]. We have used public APIFootnote 3 of Babelfy for NER and NED task as separate components.

  4. 4.

    AGDISTIS is a graph based disambiguation tool that couples the HITS algorithm with label expansion strategies and string similarity measures to disambiguate entities in a given text [30]. The code is publicly availableFootnote 4.

  5. 5.

    DBpedia Spotlight is a web serviceFootnote 5 that uses vector-space representation of entities and using the cosine similarity, recognise and disambiguate the entities [15].

  6. 6.

    Tag Me matches terms in a given text with Wikipedia, i.e. links text to recognise named entities. Furthermore, it uses the in-link graph and the page dataset to disambiguate recognised entities to its Wikipedia URls [6]. Tag Me is open source, and its REST API endpointFootnote 6 is available for further (re-)use.

  7. 7.

    Other APIs: Besides the open-source available components, there are many commercial APIs that also provides open access for the research community. We have used such APIs for NER and NED tasks. Aylien APIFootnote 7 is one of such APIs that uses natural language processing and machine learning for text analysis. Its text analysis module also consists of spotting and disambiguation entities. TextRazorFootnote 8, DandelionFootnote 9, OntotextFootnote 10 [13], AmbiverseFootnote 11, and MeaningCloudFootnote 12 are other APIs that have been used by us for NER and NED tasks.

Relation Linking Components. Relation Linking (RL) task aims to disambiguate the natural language (NL) relations present in a question to its corresponding mention in a knowledge base (KB). Considering our example question “Who is the mayor of Berlin?” a relation linker component would correctly link the text “mayor of” to dbo:leaderFootnote 13. For Relation Linking (RL), we rely on following open source components:

  1. 1.

    ReMatch maps natural language relations to knowledge graph properties by using dependency parsing characteristics with adjustment rules [18]. It then carries out a match against knowledge base properties, enhanced with word lexicon Wordnet via a set of similarity measures. It is an open source tool, and the code is available for reuse as RESTful serviceFootnote 14.

  2. 2.

    RelationMatcher: This component [24] devise semantic-index based representation of PATTY [19] (a knowledge corpus of linguistic patterns and its associated properties in DBpedia) and a search mechanism over this index with the purpose of enhancing relation linking task. We call this component RelationMatcher in Frankenstein. The main idea of this component to (1) improve linguistic similarity with cosine similarity by converting PATTY into vector space, (2) address the problem of PATTY patterns not being uniform by introducing some penalty function.

  3. 3.

    RelMatch: The disambiguation module (DM) of OKBQA framework [12] provides disambiguation of entities, classes, and relations present in a natural language question [12]. This module is the combination of AGDISTIS and disambiguation module of AutoSPARQL project [27]. We name it as RelMatch. The DM module is an independent component in OKBQA framework and available for reuseFootnote 15. We name this component as “OKBQA relation Disambiguator”.

  4. 4.

    RNLIWOD: Natural Language Interfaces for the Web of Data ((NLIWOD) community groupFootnote 16 provides reusable components for enhancing the performance of QA systems. We utilise one of its components to build similar relation linking componentFootnote 17. We call this component “RNLIWOD relation linker”.

  5. 5.

    Spot Property: This component is the combination of RNLIWOD and OKBQA disambiguation module [12] for relation linking task. We call this component Spot Property.

Components for Class Linking. To correctly generate a SPARQL query for a NL query, it is necessary to also disambiguate classes against the ontology.Footnote 18 For example, considering the question “Which river flow through Seoul” the word “river” needs to be mapped to dbo:RiverFootnote 19. In Frankenstein, we deployed two components for this task:

  1. 1.

    NLIWOD CLS: NLIWOD Class Identifier is one among the several other tools provided by NLIWOD community for reuse. The code for class identifier is available on GitHub (see footnote 17).

  2. 2.

    OKBQA Class Identifier: This component is part of OKBQA disambiguation module (see footnote 15). We reused it for specific task of class linking.

Components for Query Builder. A query builder generates SPARQL queries using disambiguated entities, relations and classes which can serve as input from previous steps in QA pipeline. We have used two components for this task:

  1. 1.

    NLIWOD QB: Template based query builders are widely used in QA community for SPARQL query construction (e.g. HAWK [29], TBSL [27] etc). We build a template based SPARQL query construction based on NLIWOD reusable resources (see footnote 17).

  2. 2.

    SINA Query Builder: SINA is a keyword and natural language query search engine that is based on Hidden Markov Models for choosing the correct dataset to query [21]. We decoupled the existing SINA implementation to use its Query builder as independent component.

The complete list of components, their expected input and output can be found in our public GitHub repositoryFootnote 20.

4 Approach for Building Reusable QA Components Within Frankenstein

Figure 1 represents the resource-wise (module-wise) architecture of Frankenstein. It is decoupled into two independent categories, (i) which provides an individual wrapper for each component, and (ii) which provides an individual runner & evaluator for every integrated component. In the following, these two sets of resources are described in more details.

Fig. 1.
figure 1

Modules of Frankenstein (i) Reusable QA component wrappers and (ii) Evaluators.

4.1 Integration Approach and Its Challenges

Here, we present the integration approach and its associated challenges for integrating components accomplishing tasks of NER, NED, RL, CL, and QB using Qanary methodology applied in Frankenstein.

Employing Qanary Methodology and Vocabulary. Qanary follows a micro service-based architecture where all components are accessible as RESTful services [3] to be possibly integrated into a Qanary QA process. A QA process within Qanary is a knowledge-driven process where input/output about question, answer, annotations generated in different steps of QA pipeline is conceptualized and annotated by the qa vocabulary. Each component integrated into a QA pipeline populates a local knowledge graph (typically its output is annotated by qa vocabulary) shared with other components within Qanary. This way establishes a standard communication between the components to be a foundation for exchangeability and composability.

In order to be able to annotate outputs generated by all the QA tasks, we had to extend the original version of qa vocabulary [22] by adding new concepts for RL, CL, and QB tasks, and reuse annotations of NER and NED s from [4]. E.g., to describe relations appeared in the natural language question, we introduce the annotation:

figure a

For instance, for the given question Who is the mayor of Berlin?, the annotated output of RNLIWOD component for RL task is shown belowFootnote 21. There, the output i.e., http://dbpedia.org/ontology/leaderName is annotated by the qa vocabulary. However w.r.t. qa extension, we also introduce further annotations qa:AnnotationOfClass and qa:AnnotationOfAnswerSPARQL for CL and QB tasks. We reused qa:AnnotationOfSpotInstance and qa:AnnotationOfInstance for NER and NED tasks from [4].

figure b

Alignment of QA Component Annotations. To ensure interoperability of a new component with existing ones, it has to express the semantics of its generated information using the qa vocabulary. We call this the alignment of the component to qa. There are at least three ways to align the knowledge of a component about the given question. (1) SPARQL queries: A component can execute SPARQL INSERTs in the knowledge base to generate new annotations expressed using the qa vocabulary, (2) OWL axioms: When a component already generates information in a specific vocabulary like the NIF vocabulary used by DBpedia Spotlight [15] OWL axioms might express the semantic relation to the specific vocabulary to the qa vocabulary (e.g. by defining owl:sameAs rules), (3) Distributed Ontology Language (DOL): It enables heterogeneous combinations of ontologies written in different languages and logics [17]. We presented alignments of some existing components using the Qanary approach in [3] and reused similar alignments in this paper.

Wrapping Components and Challenges. During the development of Qanary wrappers in Frankenstein for different components, we encountered several challenges. The first challenge was to deal with interoperability issues among the components. For instance, a number of components were available as RESTful service, while a few ones [18, 24] had the only open source code. Thus, we implemented a RESTful service on top of their source codes to make them easily reusable. The second challenge was associated with the heterogeneous output formats of the components. While some just provide output in JSON (e.g. [6, 16]), some provide output in their own specific vocabulary (e.g. OntoText [13]). A more challenging case was decoupling SINA from its monolithic implementation required to change the complete package structure, dependencies, input format etc. of the original code to make it reusable.

4.2 Integrating Evaluation Module

Another set of valuable resources in Frankenstein is its evaluation modules. These modules have three configurations (1) benchmark creation, (2) pipeline configuration, and (3) evaluators. The configurations are briefly described below:

Creating Benchmarks for QA Tasks. We follow the methodology provided in [4, 24] to create benchmarks for each individual stage of QA pipeline. For our running example Who is the mayor of Berlin? the corresponding SPARQL query in QALD-5Footnote 22 is:

figure c

For NED and RL tasks, our modules compare the detected named entities and relations by the components with the entities or relations mentioned in the corresponding SPARQL query (e.g. res:Berlin for NED and dbo:leader for RL). For class linking (CL) components, a similar approach is applied when questions contain class references. To assess the performance of QB components, we run the generated SPARQL query and the benchmark SPARQL query, then we compare the return answers. For our running question a query builder’s performance is evaluated by For evaluating the complete pipeline, answers of the pipeline can be compared with the gold standard answers. In future, we plan to provide a simple configuration that directs the SPARQL results to GERBIL [31] which is an evaluation platform for complete QA processes.

Pipeline Configuration and Runner. To ease the process of composing pipelines, we have automatised the whole process of configuring and running them using Bash scripts. Based on the required task, the users can choose the components, update the script and automatically run the pipeline in three different modes- (1) Frankenstein static, (2) Frankenstein dynamic, and (3) Frankenstein improved [25]. Below a sample Bash command is represented:

figure d

The first command i.e., (a), runs a single component i.e., stanfordNER and the second command simultaneously run the two components Babefly and AGDISTIS. These scripts are very useful when the user want to evaluate 1000 s of questions at bulk. However, Frankenstein is provided with an in-built UI from Qanary for executing pipeline with a single input question.

Pipeline Execution. We implemented an independent module called LC-Evaluator within Frankenstein for executing pipelines. This module is customised in an automatic way for evaluating every individual component of the pipeline. This module obtains questions from a text file (supports csv and txt formats). A user can run a single component or a pipeline containing multiple components. However, the pipeline executor automatically passes the questions sequentially to the associated components. Relying on Qanary methodology, the outputs of components (annotations) are stored in a knowledge graph (i.e., Stardog v4.1.3Footnote 23). Then, the pipeline executor reads the annotations of a particular question from the triplestore and creates an independent file using the turtle format (TTL) for the given input question with the label “questionID_component.ttl”. This process is efficient in case of a large number of questions or text. The user can upload the text file containing annotations of a question, then execute the LC-Evaluator component, and all the output is automatically generated in form of .ttl extension for each question.

Pipeline Evaluator. We developed individual benchmarks for each step of QA pipeline used in evaluation module. Currently, since LC-QuAD [26] and QALD-5 [28] are the most popular and largest state-of-the-art gold standards, thus we developed individual benchmarks out of them for each separate QA task. In the future, we plan to provide additional benchmark files (e.g. for other QALD series). For NED task and for full pipeline evaluation, we plan to integrate pipeline evaluator to GERBIL platform [32]. Using these benchmarks, users are enabled to evaluate the performance of their components for any step of QA pipeline w.r.t. other QA components in Frankenstein.

5 Availability and Sustainability

In this section, we describe the accessibility of resources and our plan for its sustainability. We have published the source code of all the reusable components integrated in Frankenstein at our public GitHub repositoryFootnote 24 under GPL 3.0Footnote 25. The GitHub repository includes following files:

  • in the Qanary folder. Detailed instructions have been provided how to install and use these components integrated using Qanary.

  • and detailed instructions to use are present in Evaluation folder.

  • A complete component list of 29 integrated components of with its input, output, API restriction is provided.

Regarding the sustainability, the resources are currently maintained by WDAqua projectFootnote 26. Once project ends in December 2018, the repository will be transferred to AskNowFootnote 27, an initiative to bring all the question answering tools and techniques under a single repository.

6 Related Work

A large number of QA systems were developed in the last years. This can be inferred from the number of QA systems (>38 in the last 5 years) that were evaluated against QALDFootnote 28, a popular benchmark for QA systems. Unfortunately, most QA systems follow a monolithic approach, not only at an implementation level but also conceptually. Hence, there is limited reusability for further research. As a consequence creating new QA systems is cumbersome and inefficient. On the other hand, many QA systems reuse existing components. For example, there are services for named entity identification (NEI) and disambiguation (NED) like DBpedia Spotlight [15] and AIDA [11] that are already reused across several QA systems. We are aware of at least three frameworks attempting to provide a reusable architecture for QA systems besides Frankenstein. QALL-ME [7] provides a reusable architecture skeleton for building multilingual QA systems. The main disadvantages are that it proposes a fixed pipeline that cannot be changed. openQA [1] allows combining multiple QA systems. The main downside is that it does not offer modularisation of QA systems and requires the QA systems to be implemented in Java using the provided interfaces. OKBQA [12] is a recent and effective attempt to develop question answering systems with a collaborative effort. OKBQA has 24 components whereas Frankenstein has 29 components. The limitation of OKBQA is that it divides the components into four tasks, namely template generation, disambiguation, query generation, and answer generation and follow strict input/output format. However, in Frankenstein and its resources, the number of tasks is not fixed. When a new component performing a new task needs to be integrated into Frankenstein, it can be done just be extending qa vocabulary which is not the case in OKBQA. Frankenstein is built on top of Qanary ecosystem that also provides entity annotation benchmarking and benchmark 6 QA components [4]. Combining efforts of Frankenstein, Qanary ecosystem and OKBQA will benefit the QA community, as we have reused many independently available QA components from OKBQA repositoryFootnote 29 and plan to provide Frankenstein components for OKBQA.

Besides the QA frameworks, evaluation frameworks like GERBIL [32] have emerged over the last years. GERBIL provides means to benchmark several QA systems on multiple datasets in a comparable and repeatable way fostering the open science methodology. Using GERBIL, many entity disambiguation components can be evaluated using different datasets. However, GERBIL does not provide support for other stages of QA pipeline besides entity annotation and linking. Very recently, this framework is further extended for benchmarking the complete QA pipelines [31]. We plan to integrate some of the reusable components of along with the possibility to benchmark complete QA pipeline in GERBIL.

Besides the question answering community, reusability of components has been a long trend in software engineering [2, 10, 20]. For example, Rainbow is a popular framework that uses reusable infrastructure to support the self adaptation of software systems [9]. Apache UIMAFootnote 30 is another open source project that supports reusable framework, tools, annotators to build software systems for knowledge extraction and information analysis.

7 Conclusion and Future Work

In this paper, we decoupled the Frankenstein architecture and presented reusable resources as an extension to the Qanary. Frankenstein is dedicated to extending the Qanary ecosystem by the following contributions:

  • It contributes a large set of new components to the ecosystem of reusable components initiated by the Qanary. Consequently, researchers and practitioners are now enabled to create a large set of different QA systems out-of-the-box due to the composability features inherited from the Qanary. We calculated that only by using the components directly provided by Frankenstein/Qanary 380 different ready-to-use QA pipelines can be created with small invest on time.

  • It provides additional evaluators and intermediate data representations as well as corresponding tools which enables researchers and practitioners to evaluate the tasks within the current QA process. Using these new data sources it is possible to establish quality improving operations on particular QA components or the whole QA process while aiming at improving the QA process for particular fields of applicability. Our recent publication (cf., [25]) already proved possible impact but the additional potential for the QA community is significant.

Hence, by using the resources provided by Frankenstein the efficiency for building and evaluating QA systems increased significantly for academics and industry which might lead to a boost of new research results.

Frankenstein is contributing to our broader research agenda of offering the QA community an efficient way of applying their research to a research field which is driven by many different fields, consequently requiring a collaborative approach to achieve significant progress.

In the future, we will add even more resources to Frankenstein and its environment, particularly data from evaluations on-top of well-known benchmarks (e.g. the QALD series) leading to new insights on the strengths and flaws of the QA components available in the community as well as machine learning tools enabling optimisation of QA systems with less manual efforts.