Graph-based Arabic text semantic representation

https://doi.org/10.1016/j.ipm.2019.102183Get rights and content

Highlights

  • Semantic representation of Arabic text can facilitate several natural language processing applications such as text summarization and textual entailment.

  • A graph-based Arabic text semantic representation model were used to represent the meaning of Arabic sentences as a rooted acyclic graph.

  • The proposed representation model was evaluated according to its ability to enhance Arabic Textual Entailment recognition.

  • Arabic Textual Entailment Dataset (ArbTED) was used in the experiments, and the results showed the proposed model enhanced the performance of Arabic textual entailment recognition.

Abstract

Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.

Introduction

Semantics refers to the systematic representation of the knowledge in a sufficiently precise notation that can be used by computer programs (Hayes, 1974). The semantic relations between the text components help in better understanding of human language and in building more accurate automated cognitive systems. According to linguistics, semantics refers to the study of the relations between text components (words, statements, etc.) and their implicit signification (Abend & Rappoport, 2017), while semantic representation reflects the meaning of the text as it is understood by humans. Several applications utilize semantic representation to obtain better results in the computational linguistic area (e.g., machine translation and question answering). The core idea of the semantic representation is to develop specific and precise notations of the text that reflect its meaning.

The common techniques that are used to represent knowledge and semantic can be classified into four main groups: Predicate logic representation, Network representation, Frame representation, and Rule-based representation. In predicate logic representation, the sentences are split into words, and the semantic relations between words are defined using predicate logic notations. For instance, the statement Time is running is represented in the form of text as: running(time). Predicate logic is utilized to represent the semantic level of analysis for many languages such as English (Ali & Khan, 2009) and Urdu (Ali & Khan, 2010). The representation and retrieval complexities for complex sentences and the exclusion of supporting words (e.g., is) are the main drawbacks of this notation. In addition, predicate logic-based methods face difficulties when trying to represent ambiguous words that have different meanings (Ali & Khan, 2009). Network representation (i.e., a semantic network or semantic graph) was proposed by Quillian (1968) in 1968. It describes the text as a directed labeled graph in terms of vertices and edges. There are positive relationships between the amount of original text and the size of the semantic network and between its complexity and the time needed for the manipulation process. However, semantic networks are powerful and flexible knowledge representation techniques that could be used to model the semantic relation between text components. The frame representation is a data structure proposed by Minsky in 1974 that represents sentences as slots of objects that carry information Mylopoulos (1980). Splitting the original text into small slots and extracting their values are time-consuming processes that make the frame representation an inefficient knowledge representation. Furthermore, building the original sentence from its frame representation is very difficult (Ali & Khan, 2009). Finally, in the rule-based representation, a sentence is represented as a set of if-then rules. In rule-based systems, when a set of rules is satisfied, the system provides a solution without applying the remaining rules. Thus, the solution may differ from the solution provided when applying other rules. This allows rule-based representation to provide multiple representations of the same sentence, which makes the process of retrieving the original text from its rule-based representation a difficult task (Tayal, Raghuwanshi, & Malik, 2015).

The task of mapping the natural language text into its semantic representation is called semantic parsing. The mapping process parses the text into its semantic representation without syntactic classification of the texts components (Wilks & Fass, 1992). Semantic parsers have attracted a huge amount of attention in the field of Natural Language Processing (NLP) over the last few decades (Liang, 2016). Semantic parsers have been used to perform several NLP tasks such as Question Answering and Machine Translation. Semantic parsers have been categorized into two main types: shallow semantic parsers and deep semantic parsers. The shallow semantic parser labels each word in the original sentence according to its semantic role (Jurafsky & Martin, 2009).The deep semantic parser represents each composite component in the text depending on its meaning in the sentence (Liang & Potts, 2015).

This paper is organized as follows: In Section 2, the main objectives and goals of this research are described and listed. In Section 3,we briefly review the related work on knowledge representation and semantic representation for Arabic text. Section 4 describes the main features of the Arabic language that affect the semantic representation model. Section 5 describes the proposed model. Section 6 represents the process of building the proposed semantic graph. The experimental results are discussed in Section 7. Finally, the conclusion is presented in Section 8.

Section snippets

Research objective

In general, the semantic analysis uses well-built resources in machine learning techniques. However, in semantic analysis, there is less work dedicated to the Arabic language, and the proposed semantic methods and applications do not achieve good results. This is due to the structural and morphological complexity of the Arabic language and the lack of Arabic semantic resources. In general, most of the developed Arabic language parsers focus on the structure of the Arabic language in terms of

Related work

Several models and projects have been proposed for semantic representation and parsing of natural language text, such as Abstract Meaning Representation for (AMR) (Banarescu et al., 2013), Groningen Meaning Bank (GMB) (Bos, Basile, Evang, Venhuizen, & Bjerva, 2017), Universal Conceptual Cognitive Annotation (UCCA) (Abend & Rappoport, 2013), and Universal Networking Language (UNL) (Boguslavsky et al., 2000). These approaches differ in terms of representation type, structure (concepts and

Arabic language features

The Arabic language has a sophisticated structure in terms of grammar, syntax, and morphology. Furthermore, it has many features that make its semantic parsing a challenging task. Arabic language features can be grouped into two main types: morphological level features and sentence level features. The morphological features of Arabic words have an impact on the analysis and processing of Arabic text. These features include: agreement feature and words formation. The agreement feature refers to

The proposed model

A Graph is defined as G=(V,E) where V is a set of vertices, and E is a set of edges, where EV × V. A graph is called a weighted graph if there is a weight function Wthat assigns value for each edge in the graph. This value is application/domain dependent that can be cost, distance, or any descriptive value. Otherwise, the graph is called an unweighted graph. According to the type of edges, a graph is classified into two main types: a directed graph and an undirected graph. In the directed

Building the semantic graph

Many Arabic text processing toolkits were proposed for the Arabic language in order to perform specific text processing tasks, such as POS tagging, segmentation, dependency parsing, and named entity recognition. Farasa is one of the latest Arabic text processing toolkits that has been proposed by the Arabic Language Technologies Group in Qatar Computing Research Institute (QCRI). It is an open source text processing toolkit that provides many text processing capabilities such as segmentation,

Evaluation

The semantic graph could be evaluated according to its ability to enhance other NLP applications, such as Question Answering (QA), keyword extraction and Textual Entailment (TE) recognition. In this research, TE recognition will be used to evaluate the ability of the proposed semantic graph representation to enhance other Arabic NLP applications.

Conclusion

In this article, we proposed a graph-based semantic representation model of Arabic texts. The proposed model represents words and the semantic relation between them as a rooted acyclic graph called a semantic graph. The vertices in the proposed semantic graph consist of original words in addition to the main concepts while the edges represent the semantic relations between words. Arabic language features are considered during the semantic graph construction. The proposed representation model

CRediT authorship contribution statement

Wael Etaiwi: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Visualization. Arafat Awajan: Supervision, Validation, Writing - review & editing, Project administration.

Declaration of Competing Interest

None.

References (62)

  • O. Abend et al.

    Universal conceptual cognitive annotation (UCCA)

    Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers)

    (2013)
  • O. Abend et al.

    The state of the art in semantic representation

    Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers)

    (2017)
  • F.T. AL-Khawaldeh

    A study of the effect of resolving negation and sentiment analysis in recognizing text entailment for arabic.

    World of Computer Science & Information Technology Journal

    (2015)
  • A.T. Al-Taani et al.

    An extractive graph-based arabic text summarization approach

    The international arab conference on information technology, jordan

    (2014)
  • M. Alabbas et al.

    Natural language inference for arabic using extended tree edit distance with subtrees

    Journal of Artificial Intelligence Research

    (2013)
  • M. Alabbas et al.

    Optimising tree edit distance with subtrees for textual entailment

    Proceedings of the international conference recent advances in natural language processing RANLP 2013

    (2013)
  • N. Alami et al.

    Arabic text summarization based on graph theory

    2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA)

    (2015)
  • S. Alansary et al.

    The universal networking language in action in english-arabic machine translation

    Proceedings of 9th egyptian society of language engineering conference on language engineering,(ESOLEC 2009)

    (2009)
  • A. Ali et al.

    Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes

    2009 international conference on emerging technologies

    (2009)
  • A. Ali et al.

    Knowledge representation of urdu text using predicate logic

    2010 6th international conference on emerging technologies (ICET)

    (2010)
  • I. Androutsopoulos et al.

    A survey of paraphrasing and textual entailment methods

    Journal of Artificial Intelligence Research

    (2010)
  • H.F. de Arruda et al.

    Paragraph-based representation of texts: a complex networks approach

    Information Processing & Management

    (2019)
  • A. Awajan

    Arabic text preprocessing for the natural language processing applications

    Arab Gulf Journal of Scientific Research

    (2007)
  • A. Awajan

    Unsupervised approach for automatic keyword extraction from arabic documents

    Proceedings of the 26th conference on computational linguistics and speech processing (ROCLING 2014)

    (2014)
  • A. Awajan

    Keyword extraction from arabic documents using term equivalence classes

    ACM Transactions on Asian Low-Resource Language Information Processing

    (2015)
  • L. Banarescu et al.

    Abstract meaning representation for sembanking

    Proceedings of the 7th linguistic annotation workshop and interoperability with discourse

    (2013)
  • W. Black et al.

    Introducing the arabic wordnet project

    Proceedings of the third international wordnet conference

    (2006)
  • I. Boguslavsky et al.

    Creating a universal networking language module within an advanced nlp system

    Proceedings of the 18th conference on computational linguistics-volume 1

    (2000)
  • J. Bos et al.

    The groningen meaning bank

  • I. Bounhas et al.

    Building a morpho-semantic knowledge graph for arabic information retrieval

    Information Processing & Management

    (2019)
  • I. Dagan et al.

    The PASCAL recognising textual entailment challenge

    Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment

    (2006)
  • Cited by (20)

    • Entity recognition method for airborne products metrological traceability knowledge graph construction

      2024, Measurement: Journal of the International Measurement Confederation
    • Integrated identity and access management metamodel and pattern system for secure enterprise architecture

      2022, Data and Knowledge Engineering
      Citation Excerpt :

      Graph modelling approach is deemed suitable for representing dynamic and linked data. As real-world data are dynamic and do not follow a rigid schema, the graph modelling approach is more suitable and natural to represent real-world entities than the static relational model [47,48]. Moreover, additional elements or nodes can be added, or exiting can be modified easily compared to static and relational modelling [47].

    • Multilayer encoder and single-layer decoder for abstractive Arabic text summarization

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Most researchers often state that they proposed an abstractive text summarization model, although it is extractive in fact. The key explanation for this is the lack of resources in the Arabic language [34]. A lot of resources are dedicated to the English language.

    View all citing articles on Scopus
    View full text