C-Phrase: A system for building robust natural language interfaces to databases

https://doi.org/10.1016/j.datak.2009.10.007Get rights and content

Abstract

This article presents C-Phrase, a natural language interface system that can be configured by normal, non-specialized, web-based technical teams. C-Phrase models queries in an extended version of Codd’s tuple calculus and uses synchronous context-free grammars with lambda-expressions to represent semantic grammars. Given an arbitrary relational database, authors rapidly build an NLI using what we term the name-tailor-define protocol. We present a small study demonstrating the effectiveness of this approach for the GEO corpus and we introduce the evaluation metric of willingness that complements the standard metrics of precision and recall. However our true evaluation comes as we open-source C-Phrase.

Introduction

Although the advantages of interacting with computers in natural language (e.g., Swedish or English) are easy to recount (e.g., humans already know natural language, natural language is capable of expressing nuanced semantics, speech interfaces are eyes-free/hands-free, etc.), in practice such systems have enjoyed only limited success [3], [6]. Typically such systems are characterized by being brittle and unpredictable [21]. Slight re-phrasings of successful commands result in failures, users have difficulty comprehending system coverage, etc. Although NLIs to databases have fallen out of focus, many developments over the last 20 years make another pass at the general problem promising (e.g., maturation of relational technology and theory [1], availability of extensive lexical resources [13], high performance theorem provers [25], advances in machine learning, improvements in speech recognition [10], etc.).

One factor that has blocked the uptake of natural language interfaces (NLIs) to databases has been the economics of configuring such systems [3], [6]. Typically configuration requires high levels of linguistic expertise and long time commitments. The typical work environment has neither of these in great supply and thus when presented with the choice of building a common forms-based interface versus an NLI, organizations typically opt for forms. Our work seeks to make NLIs a more attractive option by reducing the time and expertise requirements necessary to build them.

Given the limited linguistic expertise possessed by most technical teams, modern approaches to configuring NLIs to standard relational databases come down to one of three approaches:

  • 1.

    Let authors only lightly name database elements (e.g. relations, attributes, join paths, etc.) and reduce query interpretation to graph match [4], [20].

  • 2.

    Offer a GUI based tool where an author builds a semantic grammar1 that interpret natural language queries over databases.

  • 3.

    Use machine learning to induce a semantic grammar from a corpus of natural language query/correct logical interpretation pairs [24], [11], [7], [26], [27].

Since our ultimate goal is the delivery of a high impact NLI system, it should not be surprising that we primarily adopt an authoring approach. That said, aspects of the first approach are deeply integrated into our work and we have laid the foundation for an integration of machine learning techniques to achieve highly robust NLIs ‘in the limit’ after experiencing large volumes of user queries.

There are three main assumptions that underlie C-Phrase. The first is that standard, off-the-shelf relational databases are sufficiently expressive to back NLIs. The second is that any approach must include a generation component that can accurately paraphrase database queries back to the user in natural language.2 Such a capability lets the user decide ambiguous analyzes and gives them a feeling of control, reducing the anxiety of being misunderstood by the system. The third assumption is that while users can restrict their inquiries to the closed domain of a given database, they will issue queries and commands of a sort that is highly elliptical, idiosyncratic, with spelling (or speech recognition) errors, often lacking proper syntactic structure (e.g., “energy stocks down last year up 10% this year”). Thus analysis components must be robust, seeking out near misses when input is less than ideal and the system must be adaptable, making it easy for authors to patch running systems to catch unanticipated phrasings.3

This article looks at the total problem of configuring and evaluating NLIs to databases from the system author’s perspective. Although the primary effort goes into configuring the parser and generator to associate phrases with database elements, there are other tricky issues with respect to problems such as spell checking and, more generally the leveraging of additional linguistic resources for enhanced robustness.

This article is an extension of an earlier conference paper [17] and holds a similar structure, but presents at greater depth. Section 2 lays the foundation of concepts necessary to understand our approach. This includes a formalization of naming databases elements, queries of the sort we handle and finally some basic material on synchronous context-free grammars enriched with lambda-expressions. Section 3 presents our formal approach to analyzing noun phrases, what we consider to be the primary challenge of NLIs to databases. The treatment is based on a limited derivative of X-bar theory [9] encoded in synchronous context-free grammars [2] augmented with lambda calculus expressions (λ-SCFG). Section 4 presents our GUI-based authoring tool through which an author populates the system with the formal elements described in Sections 2 Foundations, 3 Focusing on noun phrases. Section 5, which is new in this article, provides a detailed description of the processing steps that C-Phrase goes through to respond to user’s typed input. Section 6 describes the overall methods of evaluating NLIs to databases and presents a small study confirming that reasonably skilled subjects can effectively use C-Phrase’s authoring tool. Section 6 also introduces a new evaluation metric termed willingness that complements the classic notions of precision and recall. Section 7 compares our approach with other approaches to building NLIs to databases and Section 8 concludes.

Section snippets

Foundations

This work rests on relational databases and assumes that authors (and readers) have a knowledge of primary and foreign keys and Codd’s tuple calculus. There are many excellent sources covering these concepts, ranging from detailed theoretical treatments [1] to more conceptual undergraduate textbooks. Perhaps due to substantial growth within the database field, many recent textbooks have skipped Codd’s tuple calculus. We consider this to be unfortunate with respect to NLIs to databases. For SQL

Focusing on noun phrases

Through earlier and ongoing work [14], [16], we posit that the main analysis task for natural language interfaces to databases is the analysis of noun phrases, often with modifying adjectival phrases, prepositional phrases, relative clauses, and modifying gerund constructions. In fact more experienced users often type just noun phrases to describe the information they would like retrieved. In addition our experience tells us that we can model noun phrases as coordinated pre-modifiers and

Our GUI-based authoring tool

We have developed an AJAX-based authoring tool that gives an integrated GUI through which an author may import a schema from any ODBC accessible database and commence what we term the name-tailor-define cycle of authoring an NLI.

Naming actions provide simple text names for the relations, attributes and join paths of the database schema. In short one populates the naming relation (N) of Section 2.1. Fig. 3 shows the schema browser in our tool where the user has the option of exploring and naming

Processing

Now that we have described the main representations underlying C-Phrase, we shall describe the processing steps taken when a user inputs a string. Fig. 7 shows the flow of processing. This figure shows a cascade of operations that transform the user’s input string, ultimately into a system response, report of failure or a request for the user to resolve ambiguity.

Benchmark studies

Benchmark studies rely on a human built ‘gold standard’ of natural language sentence/logical query pairs, possibly obtained through wizard-of-Oz studies. The advantage of such benchmarks is that they set a bar on the expressivity and types of queries that a system must be able to handle in a given domain. For example systems incapable of handling negation or ambiguity (for example [20]), would be flagged as inadequate by such benchmarks.

A widely used benchmark for NLIs to databases is the

Related work

Due to the public availability of GEOQUERY Corpus, we can compare our initial results to several machine learning approaches [24], [11], [7], [26], [27], an approach based on light annotation [20] and an authoring approach over Microsoft’s EnglishQuery product (described in [20]).

Fig. 9, adapted from [26], displays the precision and recall measures for all the machine learning based systems [24], [11], [7], [26], [27] over the GEOQUERY Corpus. We have added results for our authoring study (for

Conclusions

This article has presented C-Phrase, a state-of-the-art system for natural language interfaces to relational databases. Internally the system uses X-bar theory [9] inspired semantic grammars encoded in λ-SCFG to map user requests to an extended variant of Codd’s tuple calculus which in turn is automatically mapped to SQL. The NLI author builds the semantic grammar through a series of naming, tailoring and defining operations within a web-based GUI. The author is shielded from the formal

Acknowledgements

I would like to acknowledge the hard work of Peter Olofsson and Alexander Näslund for building the AJAX authoring interface. I would like to also acknowledge Philipp Cimiano and Myra Spiliopoulou for a very productive discussion on precision and recall measures for NLIs to databases. Additionally thanks are due to Bart Massey for convincing me to open source C-Phrase on Google Code and for in fact helping me to do so (http://code.google.com/p/c-phrase/). Finally I would like to acknowledge

Michael Minock earned a B.S. in Computer Science from the University of Michigan Honors College in 1991 and a Ph.D. in Computer Science from UCLA in 1997. Prior to his position as a senior lecturer at the Department of Computing Science, Umea˚ University, Michael worked as a member of the technical staff at Microelectronics and Computer Technology Corporation (MCC) in Austin, Texas. In addition to NLIs to databases, He is interested in the application of probabilistic reasoning, higher-order

References (27)

  • B. Grosz et al.

    Team: an experiment in the design of transportable natural-language interfaces

    AI

    (1987)
  • N. Stratica et al.

    Using semantic templates for a natural language interface to the Cindi virtual library

    Data and Knowledge Engineering Journal

    (2005)
  • S. Abiteboul et al.

    Foundations of Database Systems

    (1995)
  • A. Aho et al.
    (1972)
  • I. Androutsopoulos et al.

    Database interfaces

  • W. Chu, F. Meng, Database query formation from natural language using semantic modeling and statistical keyword meaning...
  • P. Cimiano et al.

    Porting natural language interfaces between domains: an experimental user study with the Orakel system

  • A. Copestake et al.

    Natural language interfaces to databases

    The Natural Language Review

    (1990)
  • R. Ge, R. Mooney, A statistical semantic parser that integrates syntax and semantics, in: Proceedings of the Ninth...
  • R. Jackendoff

    X-bar-syntax: a study of phrase structure

    (1977)
  • D. Jurafsky et al.

    Speech and Language Processing

    (2000)
  • R. Kate, R. Mooney, Using string-kernels for learning semantic parsers, in: Proceedings of COLING/ACL-2006, 2006, pp....
  • O. Lemon, X. Liu, DUDE: a dialogue and understanding development environment, mapping business process models to...
  • Cited by (35)

    • A multi-agent conversational system with heterogeneous data sources access

      2016, Expert Systems with Applications
      Citation Excerpt :

      Nevertheless, these systems have a limited application scope, since most of them use deductive databases, much less used than relational databases. Some examples of intermediate representation languages systems are RENDEZVOUS (Codd, 1974), CHAT-80 (Warren & Pereira, 1982), TEAM (Grosz et al., 1987), MASQUE/SQL (Androutsopoulos et al., 1993), ORAKEL (Cimiano et al., 2008), and C-Phrase (Minock, 2010). Some authors (Pazos R. et al., 2013) state that a good natural language interface to databases (NLIDB) system should: be easy to configure and use; include tools for modifying the knowledge; make its capabilities and limitations evident to users; offer recommendations sufficiently justified; be robust in case of possible failure; answer quickly and with accuracy; answer deductive, temporal and fuzzy queries; be multimodal; be independent from the domain, the database management system, the language, the hardware and the software; handle linguistic phenomena (e.g. anaphora, ellipsis, ambiguity, or incomplete search values).

    • ONLI: An ontology-based system for querying DBpedia using natural language paradigm

      2015, Expert Systems with Applications
      Citation Excerpt :

      The above alternatives measures of inter-rater agreement are commonly applied in information retrieval experiments (Hripcsak & Rothschild, 2005). Also, these metrics are widely used by researchers in the context of NLI development (Cimiano et al., 2008; Erozel, Cicekli, & Cicekli, 2008; Minock, 2010). In order to evaluate ONLI with real-world end-users, we performed the study in the Faculty of Informatics of the University of Murcia.

    • Multimodal interaction for information retrieval using natural language

      2013, Computer Standards and Interfaces
      Citation Excerpt :

      Previous approaches to this problem can be classified in three main groups [9]: The system is adapted to a new database by lightly naming its elements (subsequently, queries can be generated using, for example, graph matching techniques) [9,10]. The database manager is assisted by easy-to-use graphical tools which are used to configure the underlying technique (e.g. a semantic grammar that interprets the natural language queries).

    • A Review of Datasets for NLIDBs

      2023, Lecture Notes in Networks and Systems
    • On the design of an advanced business rule engine

      2022, Software - Practice and Experience
    View all citing articles on Scopus

    Michael Minock earned a B.S. in Computer Science from the University of Michigan Honors College in 1991 and a Ph.D. in Computer Science from UCLA in 1997. Prior to his position as a senior lecturer at the Department of Computing Science, Umea˚ University, Michael worked as a member of the technical staff at Microelectronics and Computer Technology Corporation (MCC) in Austin, Texas. In addition to NLIs to databases, He is interested in the application of probabilistic reasoning, higher-order logic and machine learning to problems in computational linguistics and data and knowledge representation more generally.

    View full text