Elsevier

Data & Knowledge Engineering

Volumes 81–82, November–December 2012, Pages 21-45
Data & Knowledge Engineering

Inferring the semantic properties of sentences by mining syntactic parse trees

https://doi.org/10.1016/j.datak.2012.07.003Get rights and content

Abstract

We extend the mechanism of logical generalization toward syntactic parse trees and attempt to detect semantic signals unobservable in the level of keywords. Generalization from a syntactic parse tree as a measure of syntactic similarity is defined by the obtained set of maximum common sub-trees and is performed at the level of paragraphs, sentences, phrases and individual words. We analyze the semantic features of this similarity measure and compare it with the semantics of traditional anti-unification of terms. Nearest-Neighbor machine learning is then applied to relate the sentence to a semantic class.

By using a syntactic parse tree-based similarity measure instead of the bag-of-words and keyword frequency approaches, we expect to detect a subtle difference between semantic classes that is otherwise unobservable. The proposed approach is evaluated in three distinct domains in which a lack of semantic information makes the classification of sentences rather difficult. We conclude that implicit indications of semantic classes can be extracted from syntactic structures.

Introduction

Proceeding from parsing to the semantic level is an important step toward natural language understanding, with immediate applications in tasks such as information extraction and question answering [1], [10], [30], [45]. Over the last decade, there has been a dramatic shift in computational linguistics from the manual construction of grammars and knowledge bases to partially or totally automating these processes using statistical learning methods trained on large annotated or un-annotated natural language corpora.

In this paper, we explore the possibility of high-level semantic classification of natural language sentences based on syntactic(constituency) parse trees. We address semantic classes appearing in information extraction (IE) and knowledge integration problems that usually require a deep natural-language understanding [6], [8], [12].

We attempt to combine the best of two worlds of linguistics and machine learning:

  • 1)

    Rely on rich linguistic data such as constituency parse trees, and

  • 2)

    Apply a systematic way to tackle this data, such as graph-oriented deterministic machine learning.

Notice that (1) gives us rather rich set of features compared to a bag-of-words approach or shallow parsing. We need to tackle such a rich set of features with its inherent structure by a structured machine learning approach. In this study we will evaluate how this richer set of tree-based features as a subject of graph-based learning outperforms keyword-based approaches in a number of text relevance problems.

Our approach is inspired by the notion of anti-unification [26], [29] which is capable of generalizing arbitrary formulas in a formal language. We extend this notion towards anti-unifying of arbitrary linguistic structures such as constituency parse tree. In this paper we propose a definition and algorithm of a syntactic generalization which allows us to treat syntactic natural language expressions in a unified way as logic formulas.

Learning based on syntactic parse tree generalization is different from kernel methods, which are nonparametric density estimation techniques that compute a kernel function between data instances (which can include keywords, as well as their syntactic parameters), where a kernel function can be considered a similarity measure. Given a set of labeled instances, kernel methods determine the label of a novel instance by comparing it to the labeled training instances using the kernel function. Nearest neighbor classification and support-vector machines (SVMs) are two popular examples of kernel methods [62], [63]. Compared to kernel methods, syntactic generalization can be considered structure-based and deterministic; linguistic features retain their structures and are not represented as values. Regarding the edit distance class of similarity methods, its analogue in anti-unification will be discussed in Section 6.1: such methods are better adjusted to objects with more peculiar structures, such as syntactic parse trees.

The main question considered in this study is whether these semantic patterns, unobservable at the level of keyword statistics, can be inferred from a complete parse tree structure. Moreover, the argumentative structures of the way authors communicate their conclusions (as expressed by their syntactic structures) are important in relating a sentence to the above classes. Studies [13], [14] have demonstrated that graph-based machine learning can predict the plausibility of complaint scenarios based on their argumentation structures. Furthermore, we observed that learning the communicative structure of inter-human conflict scenarios can successfully classify the scenarios into a series of domains, from complaints to security-related domains. These findings convince us that applying a similar graph-based machine learning technique to such structures as syntactic trees, which have even weaker links to high-level semantic properties than these settings, can deliver satisfactory classification results. Graph based learning has been applied in a number of domins beyond linguistics (see e.g. [60]).

Most of the current learning research on NLP employs particular statistical techniques inspired by research on speech recognition, such as hidden Markov models (HMMs) and probabilistic context-free grammars (PCFGs). A variety of learning methods, including decision tree and rule induction, neural networks, instance-based methods, Bayesian network learning, inductive logic programming, explanation-based learning, and genetic algorithms can also be applied to natural-language problems and can present significant advantages in particular applications [25], [46]. In addition to specific learning algorithms, a variety of general ideas from traditional machine learning, such as active learning, boosting, reinforcement learning, constructive induction, learning with background knowledge, theory refinement, experimental evaluation methods, and PAC learnability, may also be usefully applied to natural-language problems [10]. In this study, we employ the nearest neighbor type of learning, which is relatively simple, to focus our investigation on how expressive the similarity between syntactic structures can be in the detection of weak semantic signals. Other, more complex learning techniques can be applied, being more sensitive or more cautious, after we confirm that our measure of the syntactic similarity between texts is adequate.

The computational linguistics community has assembled large data sets for a range of interesting NLP problems. Some of these problems can be reduced to a standard classification task by appropriately constructing their features; however, others require using and/or producing complex data structures, such as complete parse trees and operations on these complete parse trees In this paper, we introduce the generalization operation to a pair of parse trees for two sentences and demonstrate its role in sentence classification. The operation of generalization is defined starting at the level of lemmas and continuing through chunks/phrases all the way up to paragraphs/texts.

Learning syntactic parse trees allows one to conduct semantic inference in a domain-independent manner without using ontologies or other manually built resources. Training sets for text classification problems still need to be collected, but class assignment can be automated. Simultaneously, in contrast to most semantic inference projects, we will be restricted to a very specific semantic domain (limited set of classes), solving a number of practical problems.

The paper is organized as follows. We introduce three distinct problems of different complexities in which one or another semantic feature must be inferred from natural language sentences. We then describe the algorithm of the generalization of parse trees, followed by the nearest neighbor learning of the generalization results. The paper concludes with a comparative analysis of classification in selected problem domains, a search engine description, and a brief review of other studies with semantic inferences.

Section snippets

Application areas of syntactic generalization

In this study, we leverage the parse tree generalization technique in the automation of content management and a delivery platform [15], [57] referred to as the Integrated Opinion Delivery Environment. This platform combines data mining of the web and social networks, content aggregation, reasoning, information extraction, questioning/answering and advertising to support distributed recommendation forums for a wide variety of products and services. In addition to human users, automated agents

Generalizing portions of text

To measure the similarity of abstract entities expressed by logic formulas, a least-general generalization was proposed for a number of machine learning approaches, including explanation-based learning and inductive logic programming. Least-general generalization was originally introduced by Plotkin [26]. It is the opposite of most-general unification [27]; therefore, it is also known as anti-unification. Anti-unification was first studied in Plotkin and Robinson [26], [27]. As its name

From generalization to logical form representation

We now demonstrate how the generalization framework can be combined with semantic representations, such as logic forms, to perform the learning of a text's meaning. We have demonstrated how semantic features can be deduced from syntactic parse trees when an appropriate similarity operation is found. However, in a number of applications, certain semantic knowledge is available in advance and therefore, does not have to be learned. In this section, we show how to combine preset semantic

Syntactic generalization-based search engine and its evaluation

The search engine based on syntactic generalization is designed to provide opinion data in an aggregated form obtained from various sources. Conventional search results and Google-sponsored link formats are selected because they are the most effective and are already accepted by a vast community of users.

Comparative performance analysis in text classification domains

To evaluate the expressiveness and sensitivity of the syntactic generalization operation and its associated scoring system, we applied the Nearest Neighbor algorithm to the series of text classification tasks outlined in Section 2 (Table 3). We formed several datasets for each problem, conducted independent evaluation for this dataset and averaged the resultant accuracy (F-measure). The training and evaluation datasets of the texts and the class assignments were made by the authors. Half of

Related work

Most of the work on automated semantic inference from syntax deals with much lower semantic levels than the semantic classes we manage in this study. de Salvo Braz et al. [21] present a principled, integrated approach to semantic entailment. These authors developed an expressive knowledge representation that provides a hierarchical encoding of the structural, relational and semantic properties of the text and populated it using a variety of machine learning-based tools. An inferential mechanism

Conclusions

In this study, we demonstrated that high-level sentences semantic features such as being informative can be learned from the low-level linguistic data of a complete parse tree. Unlike the traditional approaches to the multilevel derivation of semantics from syntax, we explored the possibility of linking low-level but detailed syntactic levels with high-level pragmatic and semantic levels directly.

In recent decades, most approaches to NL semantics relied on mapping to First Order Logic

Acknowledgments

We are grateful to our colleagues SO Kuznetsov, B Kovalerchuk and others for valuable discussions and to our anonymous reviewers for their suggestions. This research is partially funded by the EU Project No. 238887, a unique European Citizens' attention service (iSAC6+) IST-PSP. This research is also funded by the Spanish MICINN (Ministerio de Ciencia e Innovación)IPT-430000-2010-13 project Social powered Agents for Knowledge search Engine (SAKE), TIN2010-17903 Comparative approaches to the

Boris Galitsky has been contributing natural language-related technologies to Silicon Valley, USA start-ups over last two decades. In 1999 he co-founded iAskWeb which was providing tax and investment recommendations to customers of a few Fortune 500 companies. He contributed his linguistic technology to Xoopit, acquired by Yahoo, Uptake, acquired by Groupon, and LogLogic, acquired by Tibco. He received his PhD in natural language understanding in 1994 and ANECA/EU Associate Professorship degree

References (68)

  • R. Bar-Haim et al.

    Semantic Inference at the Lexical-Syntactic Level AAAI-05

    (2005)
  • P. Bulychev et al.

    Duplicate code detection using anti-unification

  • Banko et al.

    Open information extraction from the web

  • U. Baldewein et al.

    Semantic role labeling with chunk sequences

    (2004)
  • M. Dzikovska et al.

    Generic parsing for multi-domain semantic interpretation

  • K. Hacioglu et al.

    Semantic role labeling by tagging syntactic chunks

  • C. Cardie et al.

    Machine learning and natural language

    Machine Learning

    (1999)
  • X. Carreras et al.

    Introduction to the CoNLL-2004 shared task: Semantic role labeling

  • B. Galitsky

    Natural language question answering system: technique of semantic headers

    (2003)
  • B. Galitsky et al.

    Learning communicative actions of conflicting human agents

    Journal of Experimental & Theoretical Artificial Intelligence

    (2008)
  • B. Galitsky et al.

    Using generalization of syntactic parse trees for taxonomy capture on the web

    ICCS

    (2011)
  • B. Galitsky et al.

    Increasing the relevance of meta-search using parse trees

  • B. Galitsky et al.

    Semantic classification based on machine learning of parse trees

  • M. Tatu et al.

    A logic-based semantic approach to recognizing textual entailment

  • V. Punyakanok et al.

    The Necessity of Syntactic Parsing for Semantic Role Labeling IJCAI-05

    (2005)
  • R. de Salvo Braz et al.

    An inference model for semantic entailment in natural language

  • D. Lin et al.

    DIRT: discovery of inference rules from text

  • J.S. Mill

    A system of logic, racionative and inductive

    (1843)
  • D. Moldovan et al.

    Cogex: a logic prover for question answering

  • G.D. Plotkin

    A note on inductive generalization

  • J.A. Robinson

    A machine-oriented logic based on the resolution principle

    Journal of the Association for Computing Machinery

    (1965)
  • L. Romano et al.

    Investigating a generic paraphrase-based approach for relation extraction

  • J.C. Reynolds

    Transformational systems and the algebraic structure of atomic formulas

    Machine Intelligence

    (1970)
  • D. Ravichandran et al.

    Learning surface text patterns for a Question Answering system

  • Cited by (31)

    • Improving relevance in a content pipeline via syntactic generalization

      2017, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Platforms like Hadoop and their implementations such as Cascading (Cascading 2013) and Mahout (Mahout 2013) are capable of parsing and learning a large amount of textual data, but the relevance and semantic features are behind. So in this study we focus on making relevance efficient and will not go into low-level implementation details (see them in (Galitsky et al., 2012)). We will evaluate how an implementation of machine learning of parse trees can improve a number of text-based content pipeline tasks.

    • Matching parse thickets for open domain question answering

      2017, Data and Knowledge Engineering
      Citation Excerpt :

      Hence, we use the DT so that certain sets of nodes in the DT correspond to questions where this text is a valid answer and certain sets of nodes correspond to an invalid answer. In our earlier studies [24,21] we applied graph learning to parse trees at the levels of both sentences and paragraphs; here we proceed to the structured graph-based match of parse thickets. Whereas for text classification problems, learning is natural, it is not obvious how one can learn by answering a given question, given a training set of valid and invalid question-answer pairs.

    • Learning parse structure of paragraphs and its applications in search

      2014, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Therefore, we can refer to the syntactic tree generalization as an operation of anti-unification of syntactic trees. To optimize the calculation of the generalization score, we conducted a computational study to determine the POS weights and deliver the most accurate similarity measure possible between sentences (Galitsky et al., 2010, 2012). The problem was formulated as finding the optimal weights for nouns, adjectives, verbs and their forms (such as gerund and past tense) so that the resultant search relevance is maximized.

    View all citing articles on Scopus

    Boris Galitsky has been contributing natural language-related technologies to Silicon Valley, USA start-ups over last two decades. In 1999 he co-founded iAskWeb which was providing tax and investment recommendations to customers of a few Fortune 500 companies. He contributed his linguistic technology to Xoopit, acquired by Yahoo, Uptake, acquired by Groupon, and LogLogic, acquired by Tibco. He received his PhD in natural language understanding in 1994 and ANECA/EU Associate Professorship degree in 2011. Boris authored more than 70 publications, a book and multiple patents in the field of natural language understanding. Boris is currently a lead scientist at eBay.

    Prof. Josep Lluís de la Rosa, [email protected], h-index = 15, MSc and Ph.D. in Computer Engineering from the Autonomous University of Barcelona (UAB), Barcelona, in 1989 and 1993, MBA in 2002. He is professor of the Universitat de Girona (UdG) and director of the ARLab (Agents Research Laboratory — GRCT69). He has published more than 100+ papers in international journals and 300+ papers in international conferences, 4 patents and 3 spin-off companies. He was visiting professor at Rensselaer Polytechnic Institute (RPI) in 2008–2010. His research interests focus on intelligent agents, understanding the agency property of introspection or self-awareness, as well as understanding its impact in the emergent behaviour of billions of agents by means of the computational ecologies models. Digital preservation, social networks and complementary currencies are the areas of application. He has participated in several successful EU projects like ONE, Open Negotiation Environments FP6-2005-IST-5, grant agreement num. 34744 (2006–2009) and PROTAGE Preservation of Digital Information with Intelligent Agents, (2007–2011).

    Gábor Dobrocsi is an Informatics Engineer who earned his M.Sc. at the University of Miskolc (Hungary). After he received his degree he became a visitor scientist at the Rensselaer Polytechnic Institute (USA) where he participated in an academic research to develop an alternative review system for scientific publications. Then he joined to the development team of a high end commercial citizens’ information and assessment service system featuring natural language processing techniques at EASY Innova (Spain). Currently he is a PhD student at the University of Girona (Spain) and researching on the fields of Agent technologies, Social search and recommendation systems and natural language processing.

    View full text