Elsevier

Data & Knowledge Engineering

Volume 42, Issue 3, September 2002, Pages 273-291
Data & Knowledge Engineering

An integrated, dual learner for grammars and ontologies

https://doi.org/10.1016/S0169-023X(02)00046-0Get rights and content

Abstract

We introduce a dual-use methodology for automating the maintenance and growth of two types of knowledge sources, which are crucial for natural language text understanding—background knowledge of the underlying domain and linguistic knowledge about the lexicon and the grammar of the underlying natural language. A particularity of this approach is that learning occurs simultaneously with the on-going text understanding process. The knowledge assimilation process is centered around the linguistic and conceptual `quality' of various forms of evidence underlying the generation, assessment and on-going refinement of lexical and concept hypotheses. On the basis of the strength of evidence, hypotheses are ranked according to qualitative plausibility criteria, and the most reasonable ones are selected for assimilation into the already given lexical class hierarchy and domain ontology.

Introduction

Intelligent systems require knowledge-rich resources to reason with. As their creation and maintenance is usually delegated to human experts who are slow and costly, these systems face the often deplored knowledge acquisition bottleneck. The knowledge supply challenge is even more pressing when various knowledge sources have to be provided within the framework of a single system, all at the same time. This is typically the case for knowledge-intensive natural language processing (NLP) systems which require simultaneous feeding with a lexical inventory, morphological and syntactic rules or constraints, and semantic as well as conceptual knowledge.

Each of these subsystems embody an enormous amount of specialized component knowledge on its own. Much emphasis has already been put on providing machine learning support for single of these components—morphological [4], lexical [19], [23], syntactic [3], [10], [13], semantic [5], [12], [18] and conceptual knowledge [9], [11], [15], [25]. But only Cardie [2] has made an attempt, up to now, to combine these isolated streams of linguistic knowledge acquisition within a common approach, i.e., to learn different types of relevant NLP knowledge simultaneously.

We also propose such an integrated approach for learning lexical/syntactic and conceptual knowledge. New concepts are acquired and positioned in the concept taxonomy, as well as the grammatical status of their lexical correlates is learned by taking two a priori given knowledge sources into account. Domain knowledge provides a concept and role taxonomy which serves as a comparison scale for judging the plausibility of newly derived concept descriptions in the light of that prior knowledge. Grammatical knowledge contains a hierarchy of lexical classes which make increasingly restrictive grammatical constraints available for linking an unknown word with its corresponding word class. Our model makes explicit the kind of qualitative reasoning that is behind these multi-threaded learning processes [9], [24].

In this article, we will informally introduce our approach by discussing a concrete learning scenario from a grammatical as well as conceptual angle in Section 2. We will then describe the model of the knowledge acquisition process and the architecture of the learning system in more depth in Section 3. After that, we return briefly to the learning scenario already introduced in Section 2 by considering additional details in Section 4. Our focus will then shift in Section 5 to a thorough evaluation of our approach. We distinguish an offline learning mode, which captures the quality of the learning results after a text has been completely processed (Section 5.1), and an online mode, in which we assess the quality of the learning results as text understanding proceeds incrementally (Section 5.2). Finally, in Section 6 we discuss the advantages and drawbacks of our approach in the light of current research and our own evaluation results.

Section snippets

A learning scenario

The following informal discussion is intended to illustrate our approach. Consider a learning scenario as depicted in Fig. 1 from a grammatical perspective and in Fig. 2 from a conceptual one. Suppose, your domain knowledge tells you nothing about Itoh-Ci-8. Imagine, one day, your favorite computer magazine features an article starting with “The Itoh-Ci-8 has a size of …”. Has your knowledge increased? If so, what did you learn from just this phrase?

The learning process starts upon the reading

The learning model

The system architecture for elicitating conceptual and grammatical knowledge from texts is summarized in Fig. 3. It depicts how linguistic and conceptual evidence are generated and combined to continuously discriminate and refine the set of word class and concept hypotheses (the unknown item yet to be learned is characterized by the black square).

Grammatical knowledge for syntactic analysis is based on a fully lexicalized dependency grammar [8]. Such a grammar captures binary valency

The learning scenario revisited

Depending on the type of the syntactic construction in which the unknown lexical item occurs, different hypothesis generation rules may fire. Genitives, such as “The switch of the Itoh-Ci-8…”, place by far fewer constraints on the item to be learnt (Itoh-Ci-8 can be anything that has a switch) than, say, appositives like “The laser printer Itoh-Ci-8…” (Itoh-Ci-8 must be a laser printer). In the following, let target be the unknown item (Itoh-Ci-8) and base be the known item (“switch”). The main

Evaluation

The domain knowledge base on which we performed our evaluation experiments contained approximately 3,000 concepts and relations from the information technology (IT) domain, the grammatical class hierarchy was composed of 80 word classes. Both knowledge sources form the backbone of itsyndikate, a natural language understanding system which extracts facts, descriptive and evaluative assertions from computer magazine test reports and product announcements [7]. We randomly selected 48 texts from

Discussion and conclusions

Knowledge-based systems provide powerful means for reasoning, but it takes a lot of effort to equip them with the knowledge they need, usually by manual knowledge engineering. In this article, we have introduced an alternative solution. It is based on an automatic learning methodology in which concept and grammatical class hypotheses emerge as a result of the incremental assignment and evaluation of the quality of linguistic and conceptual evidence related to unknown words. No specialized

Dr. Udo Hahn is a professor for natural language processing at the University of Freiburg, Germany. His main research interests include the integration of text understanding and information systems, with focus on automatic summarization, information extraction, text mining, and information retrieval.

References (28)

  • U. Hahn et al.

    Let's parsetalk: message-passing protocols for object-oriented parsing

  • U. Hahn et al.

    A text understander that learns

  • M. Haruno et al.

    Using decision trees to construct a practical parser

    Machine Learning

    (1999)
  • P.M. Hastings et al.

    The ups and downs of lexical acquisition

  • Cited by (0)

    Dr. Udo Hahn is a professor for natural language processing at the University of Freiburg, Germany. His main research interests include the integration of text understanding and information systems, with focus on automatic summarization, information extraction, text mining, and information retrieval.

    Kornél Markó, MA is research assistant at the Computational Linguistics Research Group at the University of Freiburg, Germany. He works currently on methodical integration of grammar and concept learning.

    View full text