Graph-based Arabic text semantic representation
Introduction
Semantics refers to the systematic representation of the knowledge in a sufficiently precise notation that can be used by computer programs (Hayes, 1974). The semantic relations between the text components help in better understanding of human language and in building more accurate automated cognitive systems. According to linguistics, semantics refers to the study of the relations between text components (words, statements, etc.) and their implicit signification (Abend & Rappoport, 2017), while semantic representation reflects the meaning of the text as it is understood by humans. Several applications utilize semantic representation to obtain better results in the computational linguistic area (e.g., machine translation and question answering). The core idea of the semantic representation is to develop specific and precise notations of the text that reflect its meaning.
The common techniques that are used to represent knowledge and semantic can be classified into four main groups: Predicate logic representation, Network representation, Frame representation, and Rule-based representation. In predicate logic representation, the sentences are split into words, and the semantic relations between words are defined using predicate logic notations. For instance, the statement Time is running is represented in the form of text as: running(time). Predicate logic is utilized to represent the semantic level of analysis for many languages such as English (Ali & Khan, 2009) and Urdu (Ali & Khan, 2010). The representation and retrieval complexities for complex sentences and the exclusion of supporting words (e.g., is) are the main drawbacks of this notation. In addition, predicate logic-based methods face difficulties when trying to represent ambiguous words that have different meanings (Ali & Khan, 2009). Network representation (i.e., a semantic network or semantic graph) was proposed by Quillian (1968) in 1968. It describes the text as a directed labeled graph in terms of vertices and edges. There are positive relationships between the amount of original text and the size of the semantic network and between its complexity and the time needed for the manipulation process. However, semantic networks are powerful and flexible knowledge representation techniques that could be used to model the semantic relation between text components. The frame representation is a data structure proposed by Minsky in 1974 that represents sentences as slots of objects that carry information Mylopoulos (1980). Splitting the original text into small slots and extracting their values are time-consuming processes that make the frame representation an inefficient knowledge representation. Furthermore, building the original sentence from its frame representation is very difficult (Ali & Khan, 2009). Finally, in the rule-based representation, a sentence is represented as a set of if-then rules. In rule-based systems, when a set of rules is satisfied, the system provides a solution without applying the remaining rules. Thus, the solution may differ from the solution provided when applying other rules. This allows rule-based representation to provide multiple representations of the same sentence, which makes the process of retrieving the original text from its rule-based representation a difficult task (Tayal, Raghuwanshi, & Malik, 2015).
The task of mapping the natural language text into its semantic representation is called semantic parsing. The mapping process parses the text into its semantic representation without syntactic classification of the texts components (Wilks & Fass, 1992). Semantic parsers have attracted a huge amount of attention in the field of Natural Language Processing (NLP) over the last few decades (Liang, 2016). Semantic parsers have been used to perform several NLP tasks such as Question Answering and Machine Translation. Semantic parsers have been categorized into two main types: shallow semantic parsers and deep semantic parsers. The shallow semantic parser labels each word in the original sentence according to its semantic role (Jurafsky & Martin, 2009).The deep semantic parser represents each composite component in the text depending on its meaning in the sentence (Liang & Potts, 2015).
This paper is organized as follows: In Section 2, the main objectives and goals of this research are described and listed. In Section 3,we briefly review the related work on knowledge representation and semantic representation for Arabic text. Section 4 describes the main features of the Arabic language that affect the semantic representation model. Section 5 describes the proposed model. Section 6 represents the process of building the proposed semantic graph. The experimental results are discussed in Section 7. Finally, the conclusion is presented in Section 8.
Section snippets
Research objective
In general, the semantic analysis uses well-built resources in machine learning techniques. However, in semantic analysis, there is less work dedicated to the Arabic language, and the proposed semantic methods and applications do not achieve good results. This is due to the structural and morphological complexity of the Arabic language and the lack of Arabic semantic resources. In general, most of the developed Arabic language parsers focus on the structure of the Arabic language in terms of
Related work
Several models and projects have been proposed for semantic representation and parsing of natural language text, such as Abstract Meaning Representation for (AMR) (Banarescu et al., 2013), Groningen Meaning Bank (GMB) (Bos, Basile, Evang, Venhuizen, & Bjerva, 2017), Universal Conceptual Cognitive Annotation (UCCA) (Abend & Rappoport, 2013), and Universal Networking Language (UNL) (Boguslavsky et al., 2000). These approaches differ in terms of representation type, structure (concepts and
Arabic language features
The Arabic language has a sophisticated structure in terms of grammar, syntax, and morphology. Furthermore, it has many features that make its semantic parsing a challenging task. Arabic language features can be grouped into two main types: morphological level features and sentence level features. The morphological features of Arabic words have an impact on the analysis and processing of Arabic text. These features include: agreement feature and words formation. The agreement feature refers to
The proposed model
A Graph is defined as where V is a set of vertices, and E is a set of edges, where E⊆V × V. A graph is called a weighted graph if there is a weight function Wthat assigns value for each edge in the graph. This value is application/domain dependent that can be cost, distance, or any descriptive value. Otherwise, the graph is called an unweighted graph. According to the type of edges, a graph is classified into two main types: a directed graph and an undirected graph. In the directed
Building the semantic graph
Many Arabic text processing toolkits were proposed for the Arabic language in order to perform specific text processing tasks, such as POS tagging, segmentation, dependency parsing, and named entity recognition. Farasa is one of the latest Arabic text processing toolkits that has been proposed by the Arabic Language Technologies Group in Qatar Computing Research Institute (QCRI). It is an open source text processing toolkit that provides many text processing capabilities such as segmentation,
Evaluation
The semantic graph could be evaluated according to its ability to enhance other NLP applications, such as Question Answering (QA), keyword extraction and Textual Entailment (TE) recognition. In this research, TE recognition will be used to evaluate the ability of the proposed semantic graph representation to enhance other Arabic NLP applications.
Conclusion
In this article, we proposed a graph-based semantic representation model of Arabic texts. The proposed model represents words and the semantic relation between them as a rooted acyclic graph called a semantic graph. The vertices in the proposed semantic graph consist of original words in addition to the main concepts while the edges represent the semantic relations between words. Arabic language features are considered during the semantic graph construction. The proposed representation model
CRediT authorship contribution statement
Wael Etaiwi: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Visualization. Arafat Awajan: Supervision, Validation, Writing - review & editing, Project administration.
Declaration of Competing Interest
None.
References (62)
- et al.
Representing meaning of arabic sentence dynamically and more smoothly
Procedia Computer Science
(2018) - et al.
Paraphrase identification and semantic text similarity analysis in arabic news tweets using lexical, syntactic, and semantic features
Information Processing & Management
(2017) - et al.
Graph-based arabic nlp techniques: A survey
Procedia Computer Science
(2018) Increasing the accuracy of neural network classification using refined training data
Environmental Modelling & Software
(2009)- et al.
Textual entailment for arabic language based on lexical and semantic matching
International Journal of Computing and Information Sciences
(2016) Learning executable semantic parsers for natural language understanding
Communications of the ACM
(2016)Semantic networks
- et al.
The preference semantics family
Computers & Mathematics with Applications
(1992) - et al.
Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
(1989) - et al.
Farasa: A fast and furious segmenter for arabic
Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Demonstrations
(2016)
Universal conceptual cognitive annotation (UCCA)
Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers)
The state of the art in semantic representation
Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers)
A study of the effect of resolving negation and sentiment analysis in recognizing text entailment for arabic.
World of Computer Science & Information Technology Journal
An extractive graph-based arabic text summarization approach
The international arab conference on information technology, jordan
Natural language inference for arabic using extended tree edit distance with subtrees
Journal of Artificial Intelligence Research
Optimising tree edit distance with subtrees for textual entailment
Proceedings of the international conference recent advances in natural language processing RANLP 2013
Arabic text summarization based on graph theory
2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA)
The universal networking language in action in english-arabic machine translation
Proceedings of 9th egyptian society of language engineering conference on language engineering,(ESOLEC 2009)
Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes
2009 international conference on emerging technologies
Knowledge representation of urdu text using predicate logic
2010 6th international conference on emerging technologies (ICET)
A survey of paraphrasing and textual entailment methods
Journal of Artificial Intelligence Research
Paragraph-based representation of texts: a complex networks approach
Information Processing & Management
Arabic text preprocessing for the natural language processing applications
Arab Gulf Journal of Scientific Research
Unsupervised approach for automatic keyword extraction from arabic documents
Proceedings of the 26th conference on computational linguistics and speech processing (ROCLING 2014)
Keyword extraction from arabic documents using term equivalence classes
ACM Transactions on Asian Low-Resource Language Information Processing
Abstract meaning representation for sembanking
Proceedings of the 7th linguistic annotation workshop and interoperability with discourse
Introducing the arabic wordnet project
Proceedings of the third international wordnet conference
Creating a universal networking language module within an advanced nlp system
Proceedings of the 18th conference on computational linguistics-volume 1
The groningen meaning bank
Building a morpho-semantic knowledge graph for arabic information retrieval
Information Processing & Management
The PASCAL recognising textual entailment challenge
Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment
Cited by (20)
Entity recognition method for airborne products metrological traceability knowledge graph construction
2024, Measurement: Journal of the International Measurement ConfederationIntegrated identity and access management metamodel and pattern system for secure enterprise architecture
2022, Data and Knowledge EngineeringCitation Excerpt :Graph modelling approach is deemed suitable for representing dynamic and linked data. As real-world data are dynamic and do not follow a rigid schema, the graph modelling approach is more suitable and natural to represent real-world entities than the static relational model [47,48]. Moreover, additional elements or nodes can be added, or exiting can be modified easily compared to static and relational modelling [47].
Multilayer encoder and single-layer decoder for abstractive Arabic text summarization
2022, Knowledge-Based SystemsCitation Excerpt :Most researchers often state that they proposed an abstractive text summarization model, although it is extractive in fact. The key explanation for this is the lack of resources in the Arabic language [34]. A lot of resources are dedicated to the English language.
Novel textual entailment technique for the Arabic language using genetic algorithm
2021, Computer Speech and LanguageDeeply integrating unsupervised semantics and syntax into heterogeneous graphs for inductive text classification
2024, Complex and Intelligent Systems