1 Introduction

Nowadays there is a trend for providing natural language interfaces (NL interfaces, NLI) for human-machine communication. According to the Gartner’s hype cycle report for emerging technologies [11], in 2017 Virtual Assistants with NL interfaces, such as Siri (iOS), Google Assistant and Yandex.Alisa (Android), Cortana (Windows) were at the peak of the cycle. Now the advances in this area (such as NL understanding) have a major impact on the corporate environment. Because a lot of time of employees is still spent on formulating of queries for retrieving data (see, for example, [6]), it is beneficial for companies to use high-level tools to simplify this process by means of NLIDBs. NLI lowers the entry threshold for newcomers and gives an ability to formulate ad hoc queries.

NLIDBs can be classified as domain-dependent and domain-independent. The domain-dependent usually demonstrates better results but are highly restricted to the domain specifics. In [3] we have presented our domain-independent system Reply for transforming traditional information systems (IS) into intelligent ones with NL querying. End users are provided with new NLI to their existing IS by means of Reply without source code modification. It is also possible to start developing a new IS with NLI as a ready-made component to its database. To provide NLI to relational database (RDB) we have used an ontology-based data access approach that helps us automatically transform input NL query into SPARQL query and then into SQL query taking into account the specifics of concrete RDBMS, database schema, and its content.

Towards creating a platform which can be used as a self-service solution, we have taken a number preexisted stand-alone third-party modules and complemented them with our own modules to build a complete solution. Although many NLIDBs have been developed since 60s, there are only a few reports about real-world installation (e.g., [16]). Bearing in mind that “NLIDBs will not be widely used in businesses until their configuration is so easy that any computer professional could be able to perform it” [12], we have developed a number of tools, which automate the configuration throughout the whole pipeline. Thanks to that, our solution favorably distinguishes from others.

While bringing our system into real-world enterprise infrastructure we have encountered the problem that neither our, nor any third-party transformation module used in our pipeline was not taking into account user’s access privileges for the data sources. E.g., SPARQL to SQL module expects that used configuration maps elements of the ontology to available elements of the database. Direct module integration led to permission violation exceptions at execution stage accompanied with error messages that are incomprehensible to the end user.

To the best of our knowledge, there is no published approach for automated handling data access policies in ontology-based NLIDBs, which use black box NL to SPARQL and SPARQL to SQL modules that are unaware of policy restrictions.

The main contributions of our research presented in the paper are the following:

  1. 1.

    adaptive ontology-driven approach to enforce practical value of NLIDBs taking into account the issues of restricted data source access;

  2. 2.

    algorithm for handling access policy in NLIDBs based OBDA solutions, where SPARQL is used as an intermediate representation language;

  3. 3.

    demonstration of the proposed approach with an example.

The paper is divided into six sections. Section 2 contains the description of the suggested approach. Section 3 includes a demo example. Section 4 presents a preliminary evaluation of the approach. Section 5 describes the related work. Section 6 contains a conclusion and future perspectives.

2 NLIDB Obstacles and OBDA Solutions

In this paper, we will focus on access policies for 2 types of database objects such as tables and columns. Access privileges for rows (row-level security) are usually implemented with the help of data views. The user cannot violate access privileges for rows and if some data row shouldn’t be returned to the user according to the access control requirements, it should have already been filtered from the result. For cell-level privileges it is common to mask the not accessible actual value and inform the user that the value was masked. Before discussing the NLIDB obstacles, let’s say a few words about data access control in traditional information system UIs.

For a traditional information systems UI it is very easy to report current access restrictions. If there are some UI controls that lead to the execution of queries that cannot be executed, they can be disabled or hidden. Queries that cannot be executed are queries that access restricted tables (for table-level restrictions) or restricted columns in table references and search conditions clauses (for column-level restrictions). Asterisk signs in select lists are explicitly expanded to the available columns. In the case of whole database search (e.g., by substring) the search should be done only in the available tables and columns.

User interface for an ad hoc query usually has controls for specifying blocks in the resulting SQL query, e.g., there are controls for choosing output fields, conditions, sorting order etc. For the specification stage, all restricted tables and columns should be disabled or hidden. If UI suggests possible values for columns that are affected by row or cell-level security, the suggestions should contain only visible values. For the building stage, where the resulting SQL query is formulated, the building engine should use only those paths and join conditions for connecting required tables, which avoid restricted objects. If this is not possible, advanced ad hoc query UIs suggest which fields or tables should be excluded to make query building successful.

For NLIDB it is usually difficult to inform the user what information is available via NLI. The easiest way to do this is to describe the available information in the documentation. More advanced user interfaces suggest autocompletion tools. Another problem is a lack of adequate reporting if the query fails, which can occur for several reasons. In this paper, we focus on the case when the query has been successfully interpreted and the execution has failed due to permission violation.

This problem can be solved in two main ways:

  1. 1.

    Interpreting the query without considering permissions, executing the query and catching the exception, interpreting the exception and reporting the reason why this exception has happened.

  2. 2.

    Considering permissions on the interpretation stage and trying to avoid permission violation.

The challenge of the first way is that the exception could have happened due to a number of reasons, so it is difficult to take into account all the related issues. That is why we chose the second way.

Considering permissions on the interpretation stage is specific to the NLIDB type. Different NLIDB types may be classified by architecture into four groups [1]:

  1. 1.

    pattern-matching;

  2. 2.

    syntax-based systems;

  3. 3.

    semantic grammar systems;

  4. 4.

    intermediate representation languages systems.

For the intermediate representation language system, one could use SPARQL. Therefore, an input NL query is transformed into SPARQL query which is later transformed into SQL query by an OBDA framework. Reply has this type of architecture with SPARQL as the intermediate representation language.

OBDA systems built on top RDBs can be implemented in two different ways: materialized and virtual. According to the materialized approach, all the data from RDBs should be transformed into ontology representation and stored in RDF triplestores. In the virtual approach, the data is stored only in RDBs, and querying the virtual RDF graph using SPARQL leads to execution the corresponding SQL query over the original source thus avoiding the cost of materialization [2].

From the security point of view, another benefit of the virtual approach is that data is not duplicated in RDF triplestore. Although access control can be specified at the RDF level [7], not so many IT-specialists who knows not only how to specify user access rights in the RDBMS, but also how to specify user access rights in RDF triplestore. With that in mind, many companies would reject such OBDA systems because the possibility of data leaks due to incorrect configuration and may overweight the potential benefit of the intelligent system applying. If the virtual approach is used, end user can only access the data which she/he can access using traditional ad hoc query interface.

In OBDA solutions for NLIDBs there are 4 components where permissions could be considered:

  1. 1.

    OBDA framework;

  2. 2.

    NL to SPARQL translator;

  3. 3.

    ontology;

  4. 4.

    mappings between elements of an RDB schema and elements of an ontology.

For this paper, we treat OBDA framework as a black box with ontology, mappings and SPARQL query as input and an SQL query as an output. With this assumption, we show how OBDA solutions for NLIDBs could be enhanced.

The original architecture is presented at Fig. 1. For the sake of simplicity do not depict the NL to SPARQL module errors on the diagram.

Fig. 1.
figure 1

Original Reply’s execution stage.

OBDA mapping specification consists of 2 blocks: Source and Target (in the Ontop mappings notation). Source is an SQL SELECT query that would be executed by RDB to raise triplets defined in Target.

On the assumption that Target contains only one template for an RDF triple, this triple has one of the following types:

  1. 1.

    <instance_uri>  a  <class_uri>

  2. 2.

    <instance1_uri>  <object_property_uri>  <instance2_uri>

  3. 3.

    <instance_uri>  <data_property_uri>  <value>

SQL SELECT query has the following structure:

When considering the permission aspect, select_list and conditions from the SQL SELECT query could use fields, which could be forbidden. In addition, the from_item can refer to the forbidden tables or fields. Although SQL looks different from the NL query point of view, their building blocks are the same. NL query can vary output fields, requested entities types, and conditions. Like in SQL, output fields can be specified implicitly.

The easiest way to prevent execution failure of the translated query is to remove elements of the ontology that are mapped to the forbidden elements of the RDB. In this case, the queries that contain removed elements of ontology could be misinterpreted. Firstly, the closest element in the edited ontology maybe not related to the concept/relation which the user has asked for. Secondly, if the closest concept/relation could not be found, the related part of the query would be completely ignored, and the NLIDB has not enough data to inform the user about the situation and give some recommendations about the query reformulation. Worse, in this case, the user can conclude that the system cannot handle complex queries. Finally, the system would not know why the user was not satisfied with the query result and would not give any recommendations about query reformulation.

In order to tackle this problem, we suggest the following approach. For each user (or for each role if multiple users have the same privileges) automatically a new instance of ontology and a set of mappings are created by means of our personalization module. The individual user’s instance of ontology is created by copying the common ontology and annotating the elements for which mappings have been removed and no other mappings left. The individual user’s set of mappings is created by removing from the common mapping specification only those ones, Sources of which suppose access to restricted tables or fields. This process is presented at Fig. 2.

Fig. 2.
figure 2

Creating individual user’s instance of ontology and set of mappings.

To provide the best user experience translators from NL to SPARQL should handle permission-based restrictions. Unfortunately, none of the tools that we have analyzed [4, 15, 18] support this. Here we suggest an approach how the existing NL to SPARQL translators can be minimally modified to support permission-based restrictions.

One of the main requirements is that each output data property, that wasn’t explicitly specified in the user’s query, should be wrapped into OPTIONAL block of the related SPARQL query. If there are no mappings for the data property that is wrapped into OPTIONAL block [17], OBDA framework would return the corresponding SQL query where the not found column would have NULL values.

To make Reply more adaptable to the user preferences the default values for output attributes of the queried concepts are supported. E.g., if some user types the query “list clients”, a predefined subset of fields would be returned per client. Additionally, the system supports application-wide defaults, and the user can also choose the preferred attributes. This approach is implemented by excluding OPTIONAL blocks with output data properties that are not in the corresponding preferred list. An alternative connection between the related classes via object properties can be specified by means of UNION.

Fig. 3.
figure 3

Modified Reply’s execution stage. (Color figure online)

Algorithm 1 describes steps of our approach in case of permission change. Algorithm 2 can be used for modifying intermediate SPARQL query for user’s NL query. The complete transformation process is presented at Fig. 3. The new modules are marked by the blue color.

figure c
figure d

3 Example

To demonstrate the suggested approach, we present an example of our demo database about apartment agreements. Each house contains a number of apartments. Some of the apartments are rented, and some of them have been sold. For each rented apartment there is a rental agreement, and for each sold apartment there is a service agreement.

A person who has a running rental agreement is called tenant, and a person who owns an apartment (has a running service agreement from the landlord’s point of view) — owner. The fragment of this database schema is presented at Fig. 4.

Fig. 4.
figure 4

Fragment of database schema about apartment agreements.

The restrictions on this database are the following: technical support specialist have no access to service_agreement table (therefore have no access to the information who are the owners in the list of people), and have no access to a social security number of any person. If the database has row-level access policies, the forbidden rows would be hidden by the database management system itself so there is no need in a special treatment on the OBDA layer. If there is a cell-level access policy, restricted sells would be marked as hidden (either by NULL or predefined value) and would not need the special treatment.

As it was suggested above, the individual user’s set of mappings was created by removing from the source mapping specification only those ones, Sources of which suppose access to restricted tables or fields.

Figure 5 presents a result of comparison between source mapping rules file and the corresponding individual user’s file. The latter lacks mappings related to service_agreement table and SSN column.

Fig. 5.
figure 5

Comparison between source and individual mapping rule files.

For NL query “Return all information about the tenants” the traditional NL to SPARQL engine would detect that person has an id, first_name, last_name, phone, and SSN and would formulate a SPARQL query returning all these columns. Running this query would lead to an access denied exception due to the SSN’s restriction. To solve this problem one could supply the NL to SPARQL engine information about the restricted fields or post-process the generated query. The easiest solution is to put the expected fields in OPTIONAL blocks. In that case, because the mapping for the forbidden fields has been removed, the OBDA framework would just skip them without any exception.

The same solution is adequate for the cases when all the output fields are available and explicitly specified (e.g., NL query “Return first and last name of tenants”). For NL query “Return first name, last name and SSN of tenants” OBDA framework would fail to generate the SQL query because of the lack of mappings for SSN. To report the user that the field SSN was detected in the query but was omitted because of the policy restrictions, our system parses the generated SPARQL query to find the data properties for which there are no mappings. This approach is error-prone, that’s why we suggest for NL to SPARQL query engines to explicitly output all detected classes, objects and data properties as well as the role they play in the query.

All constraints, retrieved from the query should be specified in the SPARQL query as usual. If the user constrains the available field, the query would be successfully executed (e.g., “Return all information about tenants with last name Doe”). If the user constrains the restricted field (“Return all information about a person with SSN 078-05-1120”), the OBDA framework would fail to generate the corresponding SQL query due to the lack of mappings for SSN attribute. There is an option to remove the constraint automatically and to answer the query without this constraint, but we believe that most of the time it would be not the result the user expects and it is more adequate to explain to the user why this query has no opportunity to be executed.

Although there is no table named tenant in the database, we specify tenants as people who have a running rental agreement. The same approach applies to the owners. The user with the permission policy specified above would not have access to service agreements so she/he not able to get the expected response to the query “Return all information about service agreements”.

For NL query “Return all information about people at 1650 El Prado” the system will establish that the only object property available between person and apartment (which is a part of the house at 1650 El Prado) is having a rental agreement, and would return only data about tenants. If the user explicitly asks about the owners (“Return all information about owners at 1650 El Prado”) the system demonstrates the intelligent capabilities and returns the suggestion to reformulate the query asking about people (superclass) or tenants (sibling).

An example of the SPARQL query that would be generated for the NL query “Return all information about people at 1650 El Prado”:

figure e

The first block of the UNION clause and the SSN’s OPTIONAL clause would be omitted by an OBDA framework due to the lack of mappings which have been removed according to our approach.

4 Implementation and Testing Specifics

The suggested approach was implemented in our system Reply which transforms traditional information systems into intelligent ones with a natural language query interface. In the previous version of Reply administrator had an ability to create multiple independent versions of intelligent systems by specifying multiple pairs of ontology and mappings. In the current version, in addition to that, a base pair of ontology and mappings could be specified and a permission tracking option could be enabled. In that case, our system automatically retrieves access policies from RDB metadata and generate multiple configurations per selected users according to the suggested approach. This process can be rerun from the system administrator’s GUI, by an API call or automatically by our system. The automatic permission tracking is available for a subset of supported databases.

For example, to automatically track changes in user permissions for PostgreSQL we generate a code for creating an event trigger that is fired in case of GRANT or REVOKE DDL commands. The generated code should be executed by a database superuser. In the body of the trigger, we raise a notification to a channel that our system listens to. PostgreSQL’s metadata about user permissions is retrieved from the information_schema.column_privileges view.

To parse SQL queries in the Sources of the mappings we use JSqlParser, and for manipulating generated SPARQL query — Apache Jena.

Because we haven’t found a NL to SPARQL translator that outputs implicitly requested attributes in a consistent way to our approach, testing the whole pipeline starting with NL query would not provide more information than starting with SPARQL query. As it was said above, our approach is independent not only from the language used by the user but also from the specifics of the query, from the specifics of the implementations of NL to SPARQL and SPARQL to SQL translators.

To test our system, we use the database from the example above (see Fig. 4). The database schema contains one-to-many and many-to-many relations, as well as multiple paths between tables. We have generated SPARQL queries, which use combinations of available and forbidden tables and columns. Our system has successfully reported permission violation and suggested what elements of the NL query should be avoided (even though there was no one). In the real-world example, the reported elements might not be explicitly included in the user’s NL query, and so to make the report even clearer, tighter integration with NL to SPARQL translator is needed. We have also tested whether our system correctly reacts to permission updates.

5 Related Work

Modern corporate NLIDBs have a number of additional requirements that are not usually considered by academia:

  1. 1.

    taking into account user’s access privileges for the data sources;

  2. 2.

    providing out-of-the-box integration with corporate chat platforms;

  3. 3.

    reusing existing corporate knowledge resources for automation the configuration stage.

At present, there are no platforms, which provide a comprehensive OBDA solutions for NLIDB and at the same time cover all of its pipeline stages. Nevertheless, there are a number of tools, which implement the individual stages of the pipeline to transform NL query into SQL query with the intermediate SPARQL representation [2, 4, 15, 18]. Almost all of them expect that either all instances in ontology or the whole database are available. This is not the case for corporate environment where the users usually have only limited access to data. The papers [10, 12] present sufficiently complete overview of the NLIDB state of the art, including benefits and limitations of different solutions.

Another key factor for successful installation NLIDB into corporate environment is bringing a tight integration with existing NL tools. One of the most widely used types of them is corporate chat platforms that support chat bots. Amongst other things, chat bots can report a status of the running systems, run tests and even deploy code to the production environment. From the earliest versions, Reply had an API that can be called from user’s NL query interface to retrieve the query result. While bringing Reply into real-world installations we have developed out of the box wrappers for the most popular corporate chat platforms. Reply can be integrated into third-party web-based information systems as a separate web-page. Due to the fact that we provide HTTP REST API, Reply can also be used by desktop applications.

Previously, to automate Reply configuring we have successfully implemented a number of our [14] and other’s [5] state of art techniques. E.g., we automatically convert RDB’s schema to ontology representation and enrich it with synonyms, parent and child concepts using our or third-party ontology or linguistic resources [8, 13]. In corporate environment, we can also use corporate knowledge resources and document storages as a source for ontology learning. Nowadays, many of knowledge resources use wiki or wiki-based markup language. Our approach for constructing ontologies from knowledge bases being developed is similar to the approach used for building DBpedia [9].

To improve the comprehensiveness of NLIDB’s and make the NL-interface enable users to get really practical results we suggest new intelligent tools to adapt NL-interface to corporate environments on self-service principles.

6 Conclusion

In the paper, we presented the new approach to intelligent handling database access policies within OBDA. It helps to tackle the problems of adaptation of NLIDBs to corporate environments. Our approach does not depend on the methods used to implement the transformation of NL query to SPARQL as well as SPARQL to SQL so it is applicable to a wide range of OBDA solutions for NLIDBs.

According to the algorithm, the system automatically modifies the generated SPARQL query taking into account the access policies of target DBMS and user’s preferences. The algorithm is demonstrated on the apartment agreements database example.

In future, we plan to apply developed tools and services not only for querying RDBMS using natural language but also for inserting data into them. We believe that it is possible to use our intelligent tools to improve the human-computer interaction within traditional corporate environments and also in other environments, such as human-centric Internet of Things.