Graph-based Arabic text semantic representation

doi:10.1016/j.ipm.2019.102183

Information Processing & Management

Volume 57, Issue 3, May 2020, 102183

https://doi.org/10.1016/j.ipm.2019.102183 Get rights and content

Highlights

•
Semantic representation of Arabic text can facilitate several natural language processing applications such as text summarization and textual entailment.
•
A graph-based Arabic text semantic representation model were used to represent the meaning of Arabic sentences as a rooted acyclic graph.
•
The proposed representation model was evaluated according to its ability to enhance Arabic Textual Entailment recognition.
•
Arabic Textual Entailment Dataset (ArbTED) was used in the experiments, and the results showed the proposed model enhanced the performance of Arabic textual entailment recognition.

Abstract

Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.

Introduction

Semantics refers to the systematic representation of the knowledge in a sufficiently precise notation that can be used by computer programs (Hayes, 1974). The semantic relations between the text components help in better understanding of human language and in building more accurate automated cognitive systems. According to linguistics, semantics refers to the study of the relations between text components (words, statements, etc.) and their implicit signification (Abend & Rappoport, 2017), while semantic representation reflects the meaning of the text as it is understood by humans. Several applications utilize semantic representation to obtain better results in the computational linguistic area (e.g., machine translation and question answering). The core idea of the semantic representation is to develop specific and precise notations of the text that reflect its meaning.

The common techniques that are used to represent knowledge and semantic can be classified into four main groups: Predicate logic representation, Network representation, Frame representation, and Rule-based representation. In predicate logic representation, the sentences are split into words, and the semantic relations between words are defined using predicate logic notations. For instance, the statement Time is running is represented in the form of text as: running(time). Predicate logic is utilized to represent the semantic level of analysis for many languages such as English (Ali & Khan, 2009) and Urdu (Ali & Khan, 2010). The representation and retrieval complexities for complex sentences and the exclusion of supporting words (e.g., is) are the main drawbacks of this notation. In addition, predicate logic-based methods face difficulties when trying to represent ambiguous words that have different meanings (Ali & Khan, 2009). Network representation (i.e., a semantic network or semantic graph) was proposed by Quillian (1968) in 1968. It describes the text as a directed labeled graph in terms of vertices and edges. There are positive relationships between the amount of original text and the size of the semantic network and between its complexity and the time needed for the manipulation process. However, semantic networks are powerful and flexible knowledge representation techniques that could be used to model the semantic relation between text components. The frame representation is a data structure proposed by Minsky in 1974 that represents sentences as slots of objects that carry information Mylopoulos (1980). Splitting the original text into small slots and extracting their values are time-consuming processes that make the frame representation an inefficient knowledge representation. Furthermore, building the original sentence from its frame representation is very difficult (Ali & Khan, 2009). Finally, in the rule-based representation, a sentence is represented as a set of if-then rules. In rule-based systems, when a set of rules is satisfied, the system provides a solution without applying the remaining rules. Thus, the solution may differ from the solution provided when applying other rules. This allows rule-based representation to provide multiple representations of the same sentence, which makes the process of retrieving the original text from its rule-based representation a difficult task (Tayal, Raghuwanshi, & Malik, 2015).

The task of mapping the natural language text into its semantic representation is called semantic parsing. The mapping process parses the text into its semantic representation without syntactic classification of the texts components (Wilks & Fass, 1992). Semantic parsers have attracted a huge amount of attention in the field of Natural Language Processing (NLP) over the last few decades (Liang, 2016). Semantic parsers have been used to perform several NLP tasks such as Question Answering and Machine Translation. Semantic parsers have been categorized into two main types: shallow semantic parsers and deep semantic parsers. The shallow semantic parser labels each word in the original sentence according to its semantic role (Jurafsky & Martin, 2009).The deep semantic parser represents each composite component in the text depending on its meaning in the sentence (Liang & Potts, 2015).

This paper is organized as follows: In Section 2, the main objectives and goals of this research are described and listed. In Section 3,we briefly review the related work on knowledge representation and semantic representation for Arabic text. Section 4 describes the main features of the Arabic language that affect the semantic representation model. Section 5 describes the proposed model. Section 6 represents the process of building the proposed semantic graph. The experimental results are discussed in Section 7. Finally, the conclusion is presented in Section 8.

Section snippets

Research objective

In general, the semantic analysis uses well-built resources in machine learning techniques. However, in semantic analysis, there is less work dedicated to the Arabic language, and the proposed semantic methods and applications do not achieve good results. This is due to the structural and morphological complexity of the Arabic language and the lack of Arabic semantic resources. In general, most of the developed Arabic language parsers focus on the structure of the Arabic language in terms of

Related work

Several models and projects have been proposed for semantic representation and parsing of natural language text, such as Abstract Meaning Representation for (AMR) (Banarescu et al., 2013), Groningen Meaning Bank (GMB) (Bos, Basile, Evang, Venhuizen, & Bjerva, 2017), Universal Conceptual Cognitive Annotation (UCCA) (Abend & Rappoport, 2013), and Universal Networking Language (UNL) (Boguslavsky et al., 2000). These approaches differ in terms of representation type, structure (concepts and

Arabic language features

The Arabic language has a sophisticated structure in terms of grammar, syntax, and morphology. Furthermore, it has many features that make its semantic parsing a challenging task. Arabic language features can be grouped into two main types: morphological level features and sentence level features. The morphological features of Arabic words have an impact on the analysis and processing of Arabic text. These features include: agreement feature and words formation. The agreement feature refers to

The proposed model

A Graph is defined as $G = (V, E)$ where V is a set of vertices, and E is a set of edges, where E⊆V × V. A graph is called a weighted graph if there is a weight function Wthat assigns value for each edge in the graph. This value is application/domain dependent that can be cost, distance, or any descriptive value. Otherwise, the graph is called an unweighted graph. According to the type of edges, a graph is classified into two main types: a directed graph and an undirected graph. In the directed

Building the semantic graph

Many Arabic text processing toolkits were proposed for the Arabic language in order to perform specific text processing tasks, such as POS tagging, segmentation, dependency parsing, and named entity recognition. Farasa is one of the latest Arabic text processing toolkits that has been proposed by the Arabic Language Technologies Group in Qatar Computing Research Institute (QCRI). It is an open source text processing toolkit that provides many text processing capabilities such as segmentation,

Evaluation

The semantic graph could be evaluated according to its ability to enhance other NLP applications, such as Question Answering (QA), keyword extraction and Textual Entailment (TE) recognition. In this research, TE recognition will be used to evaluate the ability of the proposed semantic graph representation to enhance other Arabic NLP applications.

Conclusion

In this article, we proposed a graph-based semantic representation model of Arabic texts. The proposed model represents words and the semantic relation between them as a rooted acyclic graph called a semantic graph. The vertices in the proposed semantic graph consist of original words in addition to the main concepts while the edges represent the semantic relations between words. Arabic language features are considered during the semantic graph construction. The proposed representation model

CRediT authorship contribution statement

Wael Etaiwi: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Visualization. Arafat Awajan: Supervision, Validation, Writing - review & editing, Project administration.

Declaration of Competing Interest

None.

References (62)

Y. Aboamer et al.
Representing meaning of arabic sentence dynamically and more smoothly
Procedia Computer Science
(2018)
M. AL-Smadi et al.
Paraphrase identification and semantic text similarity analysis in arabic news tweets using lexical, syntactic, and semantic features
Information Processing & Management
(2017)
W. Etaiwi et al.
Graph-based arabic nlp techniques: A survey
Procedia Computer Science
(2018)
T. Kavzoglu
Increasing the accuracy of neural network classification using refined training data
Environmental Modelling & Software
(2009)
M. Khader et al.
Textual entailment for arabic language based on lexical and semantic matching
International Journal of Computing and Information Sciences
(2016)
P. Liang
Learning executable semantic parsers for natural language understanding
Communications of the ACM
(2016)
M.R. Quillian
Semantic networks
Y. Wilks et al.
The preference semantics family
Computers & Mathematics with Applications
(1992)
K. Zhang et al.
Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
(1989)
A. Abdelali et al.
Farasa: A fast and furious segmenter for arabic
Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Demonstrations
(2016)

O. Abend et al.

Universal conceptual cognitive annotation (UCCA)

Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers)

(2013)

O. Abend et al.

The state of the art in semantic representation

Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers)

(2017)

F.T. AL-Khawaldeh

A study of the effect of resolving negation and sentiment analysis in recognizing text entailment for arabic.

World of Computer Science & Information Technology Journal

(2015)

A.T. Al-Taani et al.

An extractive graph-based arabic text summarization approach

The international arab conference on information technology, jordan

(2014)

M. Alabbas et al.

Natural language inference for arabic using extended tree edit distance with subtrees

Journal of Artificial Intelligence Research

(2013)

M. Alabbas et al.

Optimising tree edit distance with subtrees for textual entailment

Proceedings of the international conference recent advances in natural language processing RANLP 2013

(2013)

N. Alami et al.

Arabic text summarization based on graph theory

2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA)

(2015)

S. Alansary et al.

The universal networking language in action in english-arabic machine translation

Proceedings of 9th egyptian society of language engineering conference on language engineering,(ESOLEC 2009)

(2009)

A. Ali et al.

Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes

2009 international conference on emerging technologies

(2009)

A. Ali et al.

Knowledge representation of urdu text using predicate logic

2010 6th international conference on emerging technologies (ICET)

(2010)

I. Androutsopoulos et al.

A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research

(2010)

H.F. de Arruda et al.

Paragraph-based representation of texts: a complex networks approach

Information Processing & Management

(2019)

A. Awajan

Arabic text preprocessing for the natural language processing applications

Arab Gulf Journal of Scientific Research

(2007)

A. Awajan

Unsupervised approach for automatic keyword extraction from arabic documents

Proceedings of the 26th conference on computational linguistics and speech processing (ROCLING 2014)

(2014)

A. Awajan

Keyword extraction from arabic documents using term equivalence classes

ACM Transactions on Asian Low-Resource Language Information Processing

(2015)

L. Banarescu et al.

Abstract meaning representation for sembanking

Proceedings of the 7th linguistic annotation workshop and interoperability with discourse

(2013)

W. Black et al.

Introducing the arabic wordnet project

Proceedings of the third international wordnet conference

(2006)

I. Boguslavsky et al.

Creating a universal networking language module within an advanced nlp system

Proceedings of the 18th conference on computational linguistics-volume 1

(2000)

J. Bos et al.

The groningen meaning bank

I. Bounhas et al.

Building a morpho-semantic knowledge graph for arabic information retrieval

Information Processing & Management

(2019)

I. Dagan et al.

The PASCAL recognising textual entailment challenge

Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment

(2006)

Cited by (20)

Entity recognition method for airborne products metrological traceability knowledge graph construction
2024, Measurement: Journal of the International Measurement Confederation
The airborne system, as one of the complex and extensive subsystems of an aircraft, primarily performs critical flight assurance functions. The quality of its components has a direct impact on the aircraft's safety and reliability. Metrology documents comprehensively document the performance parameters throughout the entire product life cycle. The Metrological Traceability Knowledge Graph (MTKG) for airborne products offers decision support to engineers engaged in metrological tasks, ensuring the continuous high quality of the products. This paper introduces an entity recognition method for airborne product metrological traceability knowledge graph construction. First, the ontology for MTKG is developed. Next, a fine-tuned multi-network model is proposed. Named entities in the field of metrology are recognized through three stages: word vector representation, sentence feature extraction, and optimal label assignment. Meanwhile, active learning methods are incorporated to reduce the expense of data annotation. The proposed model is validated using an actual metrology corpus, and the experimental results demonstrate its superior performance compared to the other four baseline methods. Finally, the MTKG is developed using this approach, offering engineers intelligent applications, including metrological traceability analysis and traceability path reasoning within the process of product metrology. This enhances the metrology capabilities of airborne products and demonstrates the extensive potential of knowledge graphs in metrology.
Integrated identity and access management metamodel and pattern system for secure enterprise architecture
2022, Data and Knowledge Engineering
Citation Excerpt :
Graph modelling approach is deemed suitable for representing dynamic and linked data. As real-world data are dynamic and do not follow a rigid schema, the graph modelling approach is more suitable and natural to represent real-world entities than the static relational model [47,48]. Moreover, additional elements or nodes can be added, or exiting can be modified easily compared to static and relational modelling [47].
Identity and access management (IAM) is one of the key components of the secure enterprise architecture for protecting the digital assets of the information systems. The challenge is: How to model an integrated IAM for a secure enterprise architecture to protect digital assets? This research aims to address this question and develops an ontology based integrated IAM metamodel for the secure digital enterprise architecture (EA). Business domain and technology agnostic characteristics of the developed IAM metamodel will allow it to develop IAM models for different types of information systems. Well-known design science research (DSR) methodology was adopted to conduct this research. The developed IAM metamodel is evaluated by using the demonstration method. Furthermore, as a part of the evaluation, a pattern system has been developed, consisting of eight IAM patterns. Each pattern offers a solution to a specific IAM related problem. The outcome of this research indicates that enterprise, IAM and information systems architects and academic researchers can use the proposed IAM metamodel and the pattern system to design and implement situation-specific IAM models within the overall context of a secure EA for information systems.
Multilayer encoder and single-layer decoder for abstractive Arabic text summarization
2022, Knowledge-Based Systems
Citation Excerpt :
Most researchers often state that they proposed an abstractive text summarization model, although it is extractive in fact. The key explanation for this is the lack of resources in the Arabic language [34]. A lot of resources are dedicated to the English language.
In this paper, an abstractive Arabic text summarization model that is based on sequence-to-sequence recurrent neural networks is proposed. It consists of a multilayer encoder and single-layer decoder. Encoder layers utilize bidirectional long short-term memory, whereas the decoder utilizes unidirectional long short-term memory. The encoder layers are the input text layer, keywords layer and the name entities layer. Moreover, the decoder uses a global attention mechanism that considers all the input hidden states to generate the summary words. The experiments are conducted on a dataset collected from several resources. The quality of the generated summary is measured quantitatively and qualitatively. In the quantitative measure, in addition to ROUGE1, three new evaluation measures are proposed to evaluate the quality of the generated summary, called ROUGE1-NOORDER, ROUGE1-STEM and ROUGE1-CONTEXT. One of the reasons for proposing new evaluation measures is that the abstractive nature of the summary requires more context based evaluations. Another reason refers to the morphological nature of the Arabic language since several words can be generated from the same root using morphemes. Moreover, a qualitative evaluation measure that is performed by a human is used to evaluate the readability and the relevance of the generated summary since it is hard to automatically measure the readability and relevance. The experimental results show that the multilayer encoder models provide the best results, where the values of ROUGE1, ROUGE1-NOORDER, ROUGE1-STEM and ROUGE1-CONTEXT of the proposed model are 38.4, 46.2, 52.6 and 58.1, respectively. Furthermore, the qualitative evaluation shows that the proposed model is the best, achieving an average readability and relevant value equal to 75.9%.
Novel textual entailment technique for the Arabic language using genetic algorithm
2021, Computer Speech and Language
This paper presents a textual entailment (TE) model that considers entailment as an optimization problem. The proposed TE model employs a genetic algorithm to derive an optimal similarity function and correlated entailment judgment threshold. The similarity function is formulated through a linear combination of text similarity measures and weights. Two text similarity measures are considered: cosine and the longest common substring. These text similarity measures are computed for each text pair. The weights represent the importance of the considered text similarity measures for generating an entailment judgment. The weights and correlated judgment thresholds are obtained by the genetic algorithm. Several experiments are conducted using the ArbTED dataset to evaluate the performance of the proposed model. Comparative results demonstrate the superiority of the proposed model. On average, the model achieved a 16% improvement in terms of accuracy. Furthermore, the average recall and precision values were 72.7% and 72.3%, respectively.
Deeply integrating unsupervised semantics and syntax into heterogeneous graphs for inductive text classification
2024, Complex and Intelligent Systems
Enhance Arabic Clustering using an Educated Text Stemmer with WordNet as Extraction and Water Cycle Cluster
2023, Research Square

View all citing articles on Scopus

View full text

Graph-based Arabic text semantic representation

Highlights

Abstract

Introduction

Section snippets

Research objective

Related work

Arabic language features

The proposed model

Building the semantic graph

Evaluation

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Procedia Computer Science

Information Processing & Management

Procedia Computer Science

Environmental Modelling & Software

International Journal of Computing and Information Sciences

Communications of the ACM

Computers & Mathematics with Applications

SIAM Journal on Computing

Farasa: A fast and furious segmenter for arabic

Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Demonstrations

Universal conceptual cognitive annotation (UCCA)

Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers)

The state of the art in semantic representation

Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers)

A study of the effect of resolving negation and sentiment analysis in recognizing text entailment for arabic.

World of Computer Science & Information Technology Journal

An extractive graph-based arabic text summarization approach

The international arab conference on information technology, jordan

Natural language inference for arabic using extended tree edit distance with subtrees

Journal of Artificial Intelligence Research

Optimising tree edit distance with subtrees for textual entailment

Proceedings of the international conference recent advances in natural language processing RANLP 2013

Arabic text summarization based on graph theory

2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA)

The universal networking language in action in english-arabic machine translation

Proceedings of 9th egyptian society of language engineering conference on language engineering,(ESOLEC 2009)

Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes

2009 international conference on emerging technologies

Knowledge representation of urdu text using predicate logic

2010 6th international conference on emerging technologies (ICET)

A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research

Paragraph-based representation of texts: a complex networks approach

Information Processing & Management

Arabic text preprocessing for the natural language processing applications

Arab Gulf Journal of Scientific Research

Unsupervised approach for automatic keyword extraction from arabic documents

Proceedings of the 26th conference on computational linguistics and speech processing (ROCLING 2014)

Keyword extraction from arabic documents using term equivalence classes

ACM Transactions on Asian Low-Resource Language Information Processing

Abstract meaning representation for sembanking

Proceedings of the 7th linguistic annotation workshop and interoperability with discourse

Introducing the arabic wordnet project

Proceedings of the third international wordnet conference

Creating a universal networking language module within an advanced nlp system

Proceedings of the 18th conference on computational linguistics-volume 1

The groningen meaning bank

Building a morpho-semantic knowledge graph for arabic information retrieval

Information Processing & Management

The PASCAL recognising textual entailment challenge

Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment