Keywords

1 Introduction

The retrieval of information is an important step in the process of increasing knowledge. Retrieving information from different sources on the same topic is essential to increase the base of information for analyses. But scientific analyses especially benefits from the consideration of distributed information of different disciplines to regard additional aspects. However, the retrieval of disciplinary and interdisciplinary data often is technically limited, has to be done manually by the scientists.

Methods to retrieve information from different data sources have been developed for many years and range from Data Warehouses [1] which gather all information in a single database, to Federated Information Systems [2, 3] that allows interoperability and information sharing between decentrally organized information systems, and Mediator-based systems [4,5,6] that allow the retrieval of heterogeneous data from distributed sources.

All these system have in common that the data sources have to be managed by a central managing instance that must know and is able to access the distributed data sources. These are suitable approaches for data sources within one corporation or institution. However, it is not a viable approach to grant access to the own data source to an administrator from a different company or institution, for example because of reasons of privacy.

Therefore, as a contribution to solve these problems, we introduce two different information systems:

  1. 1.

    The architecture Reverse-Mediated Information System (ReMIS) is motivated by the need to search distributed data together with further contexts from multiple heterogeneous sources. ReMIS uses dynamic joins of results from heterogeneous data formats. It is based on the concept of common Mediator-based systems, but was modified to grant data owners full control over their data and allow the provision of their data without the need of a central administrator. ReMIS is mainly designed to collate data from different sub-disciplines of one field.

  2. 2.

    The extended architecture ReMIS Cloud enables information retrieval by linking different distributed data sources together to find inter-domain knowledge with an intuitive search interface. The architecture utilizes a category-based data source registration to connect differently shaped data types and formats. This novel way of information retrieval enables scientists to cross-connect information with domain-extrinsic, but coherent knowledge. ReMIS Cloud is intended to provide a central platform where all disciplines can connect and contribute data to the network, considering the requirements of the ReMIS architecture.

In this paper, we first describe ReMIS in Sect. 2 and ReMIS Cloud in Sect. 3. Then, we discuss the field of application of the two mentioned information systems and highlight the differences and similarities of the architectures in Sect. 4. Finally, we show the benefit of the two systems with an example from the archaeo-related sciences, for which both information systems are reasonably applicable, in Sect. 5.

2 Reverse-Mediated Information System

In this section, we introduce the Reverse-Mediated Information System (ReMIS) [7, 8] that is designed to connect heterogeneous information from distributed, anonymous databases.

Architecture: The basic concept of ReMIS is based upon the well-known Mediator-based system [4,5,6]. But opposed to the existing system, the configuration of the connected data sources is not done by a central managing instance. Instead, it is up to the administrators of the individual data sources to connect their data to the network. Therefore, they remain full control over their data, without a central administrator who has insight in both, the data structure and the data.

To connect a data source to the network, the data owners have to execute the Connector Application on their server. There, they have to assign a minimum set of parameters, called “Minimal Search Parameters ” (MSP) that have to be mapped to the corresponding columns in the data source. The actual parameters of the MSP are different in each applied domain. The MSP is necessary for the search, to identify the different entries sets, and to guarantee that the connected database contains the searchable parameters. For example in Bavaria, Germany, the MSP in the archaeo-related domains consist of the excavation number and the find label number (a unique number for all objects found at the same location). These parameters are available in all archaeo-related databases, independent of the underlying sub-domain.

Furthermore, the Connector Application allows to set privacy settings, so that it is up to the data owners to decide which subsets of the data shall be searchable and which should remain private. There are two options: First, by defining columns that are returned and therefore which are not. Second, by specifying conditions which have to apply for a data set to be included in the result of a query. This allows to not consider columns with sensitive information as well as to hide data sets which shall not be made public.

Once all parameters of the MSP are assigned, the data source can be registered to the Server Application that is executed on a central platform. The Server Application stores all connections to the connected data sources, without storing any actual data. Then, the data source is configured and registered to the system.

Summarized, the main difference between the well-known Mediator-based system and ReMIS is, that Mediator-based systems require a central administrator to manage and connect the data sources and to mediate the requests from the users, as sketched in Fig. 1. The administrator has always to know each data source to be able to connect them. In contrast, ReMIS is designed to allow the data owners to register and manage their data to the system on their own, as sketched in Fig. 2. The required mediation set-up is executed by a wizard dialogue. The architecture forwards the user requests to the data sources where the request is mediated.

Fig. 1.
figure 1

Sketch of the well-known Mediator-based architecture [8].

Fig. 2.
figure 2

Sketch of the Reverse-Mediated Information System (ReMIS) [8].

Data Retrieval: For the data retrieval, a form – either at a website or an embedded solution in an application – provides the possibility to the user to retrieve domain-specific information from all data sources that are connected to ReMIS. The domain-specific MSP is requested from the Server Application to enable displaying search fields for each parameter of the MSP. Then, the user enters the search term for each of the parameters of the MSP.

The search request of the user is first sent to the Server Application which forwards the user request to all connected data sources. There, the Connector Applications translate (“mediate”) all parameters of the MSP to their local data scheme. Then, the local data source is queried for the data sets that matches with the user input – the definied privacy settings are also considered in this query. The result is then sent back to the user via the Server Application. The retrieved data is displayed in table view to the user which also can be exported to CSV or spreadsheet files.

3 REMIS-Cloud

The decentral architecture of ReMIS is already a convincing approach for distributed, but similar structured data within one discipline. However, especially in sciences, the consideration of domain-extrinsic data is essential. Interdisciplinary data often exists for specific research areas but it is cumbersome and time-consuming to retrieve the coherent data sets. Therefore, we designed and implemented the ReMIS Cloud [9], an architecture that allows to register and connect data sources from different disciplines to enable the users to search for interdisciplinary, but related data sets.

Architecture: The architecture of ReMIS Cloud, that is sketched in Fig. 3, bases on the concepts of ReMIS.

Fig. 3.
figure 3

An abstract sketch of the ReMIS Cloud architecture [9].

The central aspect of the architecture is the management of “Categories” which describe (parts of) the content of the data sources. Each Category defines a minimum set of parameters, called “Category Information Definition” (CID) (equivalent to the MSP of ReMIS). Each data source that is assigned to a Category must map all parameters of the CID to guarantee that all search parameters are available. Furthermore, each Category can be assigned with tags. Basically, a Category equates the structure of the Server Application of ReMIS.

All available Categories are managed on a platform of a central server. In principle, any number of Categories can be registered there. A Category validation is reasonable to avoid duplicate or meaningless Categories.

The initialization of the connection between the data sources and the Categories is largely identical with the initialization of ReMIS between the Connector Application and the Server Application. However, there are two fundamental differences. First, a data source is not limited to one single Category. The connection to any number of Categories is possible provided that the assignment of the parameters of the CID to the data makes contextual sense. Second, each data source uses tags which can additionally be assigned in the Connector Application. Furthermore, each data source implicitly uses the tags of the corresponding Category.

Data Retrieval: In step A of the data search, from a list of Categories C, the user can select all Categories \(C^A \subseteq C\) that should be searched. For the selected Categories the corresponding CIDs are displayed in a search mask where the user can enter the search parameters \(P_{C^A}\) for the search request. Once the request is submitted, the central server forwards it to all data sources \(DS_{C^A}\) that are assigned to the selected Categories \(C^A\). There, the Connector Applications translate all parameters of the query to their local data scheme. Considering the defined privacy settings, the result \(R_{P_{C^A}}\) is retrieved from the data sources and is sent back via the central server to the user.

For step B, the user can select additional Categories \(C^B \subseteq C \backslash C^A\) to achieve the retrieval of domain-extrinsic, but content-related information. The values \(P_{C^B}\) for the search parameters of the CIDs of the Categories \(C^B\) are read out of the result \(R_{P_{C^A}}\). A further search request is executed over all data sources \(DS_{C^B}\) that are assigned to the selected Categories \(C^B\). The result \(R_{P_{C^B}}\) is then also sent back to the user.

Finally, the retrieved data \(R := R_{P_{C^A}} \cup R_{P_{C^B}}\) is displayed in table view to the user. In step C, the users can filter the result R with the available tags to be able to only display the information of interest. The final result \(R_{filtered} \subseteq R\) can be exported to CSV or spreadsheet files.

4 Comparison of REMIS and REMIS Cloud

While both presented information systems, ReMIS and ReMIS Cloud, have much in common, they still differ in some key elements. Basically, both systems enable the retrieval of heterogeneous data from distributed data sources. However, the most important difference is the scope of the systems. While ReMIS is only aimed for the retrieval of data from one well defined discipline, the ReMIS Cloud enables an interdisciplinary collation of data, including data from different scientific disciplines, knowledge information, inventory information, etc.

This requires a different infrastructure that is necessary for the systems. ReMIS is indended to be executed as single instances for each supported discipline. In comparison, the ReMIS Cloud is designed as a central platform where all Categories for the different disciplines can be added to enable interdisciplinary data retrieval. However, it is also conceivable to use an own instance of ReMIS Cloud for one field of application that has a large number of sub-disciplines.

In both systems, a central administrator, who manages all data sources, is not required in both systems. However, an administration of Categories is recommended for ReMIS Cloud to increase the quality of the available Categories. Data owners keep the full control about their data, the right management is administered in the Connector Applications, that means locally where the data sources are located. Only authorized data is transmitted to the information systems. It is the data owner who decides in the privacy settings which data can be searched and retrieved.

Briefly summarized, both systems, ReMIS and ReMIS Cloud, are designed for different scope of applications. However, it is not excluded that data sources can be connected to both systems, they are able to run in parallel on the same data source.

5 Use Case: Archaeo-Related Sciences

To demonstrate the application area in which ReMIS and ReMIS Cloud could prove useful, we want to present two use cases from the archaeo-related areas.

Retrieval of Distributed Information of an Excavation: After the excavation, the corpus of findings which were excavated (rest of buildings, artifacts, human burial remains, or faunal remains) are transfered to specialized (e.g. zooarchaeological, anthropological, archaeological, archaeobotanical, etc.) organisations where they are further analyzed and then archived or exhibited.

To analyze the find circumstances it can be important to also consider artifacts or remains of different categories which were found at the same location. For example in a grave, not only the human remains are interesting, but also which grave goods are found next to the human body to completely understand the historical context of the grave. However, since the data of the different types of findings are all distributely stored in databases of the different collections, retrieving this information requires the anthropologist to contact other scientists who have access to the different databases to get the information from them.

With ReMIS, all the different archaeo-related databases can be connected using the Minimal Find Sheet information for the MSP, that means the excavation number and the find sheet number. In Bavaria, Germany, these are assigned to the findings by the Bavarian State Department of Monuments and SitesFootnote 1 which are sent to the specialized collections, like the Bavarian State Collection for Anthropology and Palaeoanatomy MunichFootnote 2 or the Bavarian State Archaeological Collection MunichFootnote 3. Therefore, it is guaranteed that the Minimal Find Sheet information is stored in all connected data sources. The scientist, who is interested in collating related information from the different disciplines, can use ReMIS to either query all connected data sources for a specific excavation (by searching a specific excavation number) or further restrict the findings with the find sheet number.

Retrieval of Excavation-Related, Interdisciplinary Information: With ReMIS distributed data from excavations can be collated. However, also the consideration of information from other disciplines is important. For example, other data sources contain climate information for different places which are registered to the Categories “Location” (for geographic coordinates) and “Climate” (for climate information). If the databases from the archaeo-related domain also contain geo data and therefore have also been registered to the Category “Location”, this information could also easily be retrieved with ReMIS Cloud.

Therefore, first the Category “Archeao” is selected to retrieve the same data as in the first example of ReMIS. Then, in the second step, the selection of the additional Category “Location” enables retrieving related data from other data sources that also have been subscribed to this Category. In this way, it is possible for the scientists to retrieve related (among others) climate information for their findings from the excavations.