An integrated, dual learner for grammars and ontologies
Introduction
Intelligent systems require knowledge-rich resources to reason with. As their creation and maintenance is usually delegated to human experts who are slow and costly, these systems face the often deplored knowledge acquisition bottleneck. The knowledge supply challenge is even more pressing when various knowledge sources have to be provided within the framework of a single system, all at the same time. This is typically the case for knowledge-intensive natural language processing (NLP) systems which require simultaneous feeding with a lexical inventory, morphological and syntactic rules or constraints, and semantic as well as conceptual knowledge.
Each of these subsystems embody an enormous amount of specialized component knowledge on its own. Much emphasis has already been put on providing machine learning support for single of these components—morphological [4], lexical [19], [23], syntactic [3], [10], [13], semantic [5], [12], [18] and conceptual knowledge [9], [11], [15], [25]. But only Cardie [2] has made an attempt, up to now, to combine these isolated streams of linguistic knowledge acquisition within a common approach, i.e., to learn different types of relevant NLP knowledge simultaneously.
We also propose such an integrated approach for learning lexical/syntactic and conceptual knowledge. New concepts are acquired and positioned in the concept taxonomy, as well as the grammatical status of their lexical correlates is learned by taking two a priori given knowledge sources into account. Domain knowledge provides a concept and role taxonomy which serves as a comparison scale for judging the plausibility of newly derived concept descriptions in the light of that prior knowledge. Grammatical knowledge contains a hierarchy of lexical classes which make increasingly restrictive grammatical constraints available for linking an unknown word with its corresponding word class. Our model makes explicit the kind of qualitative reasoning that is behind these multi-threaded learning processes [9], [24].
In this article, we will informally introduce our approach by discussing a concrete learning scenario from a grammatical as well as conceptual angle in Section 2. We will then describe the model of the knowledge acquisition process and the architecture of the learning system in more depth in Section 3. After that, we return briefly to the learning scenario already introduced in Section 2 by considering additional details in Section 4. Our focus will then shift in Section 5 to a thorough evaluation of our approach. We distinguish an offline learning mode, which captures the quality of the learning results after a text has been completely processed (Section 5.1), and an online mode, in which we assess the quality of the learning results as text understanding proceeds incrementally (Section 5.2). Finally, in Section 6 we discuss the advantages and drawbacks of our approach in the light of current research and our own evaluation results.
Section snippets
A learning scenario
The following informal discussion is intended to illustrate our approach. Consider a learning scenario as depicted in Fig. 1 from a grammatical perspective and in Fig. 2 from a conceptual one. Suppose, your domain knowledge tells you nothing about Itoh-Ci-8. Imagine, one day, your favorite computer magazine features an article starting with “The Itoh-Ci-8 has a size of …”. Has your knowledge increased? If so, what did you learn from just this phrase?
The learning process starts upon the reading
The learning model
The system architecture for elicitating conceptual and grammatical knowledge from texts is summarized in Fig. 3. It depicts how linguistic and conceptual evidence are generated and combined to continuously discriminate and refine the set of word class and concept hypotheses (the unknown item yet to be learned is characterized by the black square).
Grammatical knowledge for syntactic analysis is based on a fully lexicalized dependency grammar [8]. Such a grammar captures binary valency
The learning scenario revisited
Depending on the type of the syntactic construction in which the unknown lexical item occurs, different hypothesis generation rules may fire. Genitives, such as “The switch of the Itoh-Ci-8…”, place by far fewer constraints on the item to be learnt (Itoh-Ci-8 can be anything that has a switch) than, say, appositives like “The laser printer Itoh-Ci-8…” (Itoh-Ci-8 must be a laser printer). In the following, let target be the unknown item (Itoh-Ci-8) and base be the known item (“switch”). The main
Evaluation
The domain knowledge base on which we performed our evaluation experiments contained approximately 3,000 concepts and relations from the information technology (IT) domain, the grammatical class hierarchy was composed of 80 word classes. Both knowledge sources form the backbone of itsyndikate, a natural language understanding system which extracts facts, descriptive and evaluative assertions from computer magazine test reports and product announcements [7]. We randomly selected 48 texts from
Discussion and conclusions
Knowledge-based systems provide powerful means for reasoning, but it takes a lot of effort to equip them with the knowledge they need, usually by manual knowledge engineering. In this article, we have introduced an alternative solution. It is based on an automatic learning methodology in which concept and grammatical class hypotheses emerge as a result of the incremental assignment and evaluation of the quality of linguistic and conceptual evidence related to unknown words. No specialized
Dr. Udo Hahn is a professor for natural language processing at the University of Freiburg, Germany. His main research interests include the integration of text understanding and information systems, with focus on automatic summarization, information extraction, text mining, and information retrieval.
References (28)
- et al.
Content management in the syndikate system: how technical documents are automatically transformed to text knowledge bases
Data and Knowledge Engineering
(2000) - et al.
Concurrent object-oriented natural language parsing: the parsetalk model
International Journal of Human-Computer Studies
(1994) - et al.
Information extraction and text summarization using linguistic knowledge acquisition
Information Processing and Management
(1989) An empirical study of automated dictionary construction for information extraction in three domains
Artificial Intelligence
(1996)- et al.
The kl-one family
Computers and Mathematics with Applications
(1992) Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging
Computational Linguistics
(1995)A case-based approach to knowledge acquisition for domain-specific sentence analysis
Tree-bank grammars
Morphological rule induction for terminology acquisition
- et al.
Determining prepositional attachment, prepositional meaning, verb meaning and thematic roles
Computational Intelligence
(1997)
Let's parsetalk: message-passing protocols for object-oriented parsing
A text understander that learns
Using decision trees to construct a practical parser
Machine Learning
The ups and downs of lexical acquisition
Cited by (0)
Dr. Udo Hahn is a professor for natural language processing at the University of Freiburg, Germany. His main research interests include the integration of text understanding and information systems, with focus on automatic summarization, information extraction, text mining, and information retrieval.
Kornl Mark, MA is research assistant at the Computational Linguistics Research Group at the University of Freiburg, Germany. He works currently on methodical integration of grammar and concept learning.