Keywords

1 Introduction

Large-scale data stream analysis has lately become one of the important business and research priorities. Social networks like Twitter and other micro-blogging platforms hold an enormous amount of data. Extracting valuable information and trends out of these data would aid in a better understanding and decision-making. Multiple analysis techniques are deployed for English content, and although the Arabic language is one of the languages that has a large amount of content over social networks, yet it is least analyzed.

As of March 2014, there are over 5.7 million Arab Twitter users, 2.4 million of those are from Saudi ArabiaFootnote 1, together producing an average of over 17 million tweets per day. This huge volume of data provides the opportunity of Sentiment Analysis (SA), enabling organizations to observe feelings and opinions of twitter users towards products, policies or people. Existing solutions to Arabic SA are limited compared to English SA approaches, the unique nature and complexity of the Arabic language requires researching appropriate solutions. Arabic is a morphologically rich language where important grammatical information is expressed at word level. Moreover, Arabic language is a collection of multiple variants, where the everyday spoken language Dialectal Arabic (DA) is different from the formal language Modern Standard Arabic (MSA). In social media, Arab users have started using their own dialect in expressing themselves. This has complicated the task of SA since most Arabic NLP tools have been developed for MSA.

Although Research in Arabic SA is still in its early stages, it is rapidly increasing. As shown in Fig. 1, which is adapted from work done by [1], the number of scientific publications (conference papers and journal articles), have rapidly risen in the last couple of years.

Fig. 1.
figure 1

Number of publications in Arabic SA in recent years

This increase in interest demands formal and systematic reviews of the area. It is highly important for the scientific community to recognize the state of the art, realize existing methodologies and tools; and address challenges and open issues.

One of the main obstacles in Arabic SA is the scarcity of high quality resources, such as datasets, corpora, and lexicons. This paper reviews main methods used to create them, their targeted dialects, and their size, in addition to their utilization by the reviewed SA approaches. The paper is organized as follows, Sect. 2 presents the survey methodology. Section 3 provides an overview of lexica resources which were used by the ASA approaches and Sect. 4 concludes the paper.

2 Survey Methodology

We followed the process from [1] in collecting the articles. The search process was conducted using keywords: ‘Arabic subjectivity and sentiment analysis’, ‘Arabic opinion mining’, ‘Comparative opinions Arabic’, and ‘Opinion spam Arabic’ in these databases: Google Scholar, Springer, IEEE explorer, ACM digital library, and Science Direct. Reviewed papers covers papers written until 2015. A total of 28 articles were selected from the retrieved publications: these included ones that introduced a new ASA resource, and that was not covered in [1]. The articles were then categorized into either an ASA approach or a resource depending on their contributions. For the ASA resources, we included ones that were used by the surveyed approaches in addition to any resources that have not been covered by previous surveys. The following sections review resources, which are divided into lexicons and corpora/datasets. In each section, the articles are presented in a tabulated form to ease readability. The aim is to provide a valuable resource for researchers when considering ASA.

3 Resources

In this section we cover linguistic resources essential to ASA approaches; these are sentiment lexicons, and corpora.

3.1 Sentiment Lexicons

Here we review papers that reported the construction of a lexicon without presenting any new methods in SA. Papers that constructed a new lexicon and developed a new approach that uses the lexicon are mentioned in the summary table (Table 1) for referencing. Where the proposed lexicons are mentioned if they were available publicly otherwise NA is written denoting that lexicon is not available or AOR if lexicon was available on request.

Table 1. Lexica used in ASA techniques

In an attempt to produce an Arabic SentiWordNet (SWN) Al-Hazmi et al. [2] proposed a methodology for mapping SWN 3.0 to Arabic. However, this resource has limited coverage (10 K) and was not tested in a sentiment analysis setting and is not publicly available. Badaro et al. [3] however, present pioneering work in the same direction by constructing ArSenL a large scale Arabic sentiment lexicon. They relied on four resources to create ArSenL: English WordNet (EWN), Arabic WordNet (AWN), English SentiWordNet (ESWN), and SAMA (Standard Arabic Morphological Analyzer). Two approaches were followed producing two different lexicons, each validated separately. Then the union of the two lexicons was validated and produced the best performance. The first approach used AWN, by mapping AWN entries into ESWN using existing offsets thus producing ArSenL-AWN. The second approach utilizes SAMA’s English glosses by finding the highest overlapping synsets between these glosses and ESWN thus producing ArSenL-Eng. Hence ArSenL is the union of these two lexicons. They evaluated the lexicon by comparing it to SIFAAT lexicon [4], and it gave the highest coverage and best performance in subjectivity and sentiment classification. Although this lexicon can be considered as the largest Arabic sentiment lexicon developed to date, it is unfortunate that it only has MSA entries and no dialect words and is not developed from a social media context which could affect the accuracy when applied to social media text. Following the example of ArSenL, the lexicon SLSA (Sentiment Lexicon for Standard Arabic) [5] is constructed by linking the lexicon of an Arabic morphological analyzer Aramorph with SentiWordNet. Although the approach is very similar to ArSenL since both use SentiWordNet to obtain the scores of words, the authors argue that SLSA uses Aramorp which is a free resource while ArSenL use SAMA which is not free and thus makes ArSenL not publicly available. Also the linking algorithm used to link the glosses in Aramorph with those in SentiWordNet is different. SLSA starts by linking every entry in Aramorph with SentiWordNet if the one-gloss word and POS match. Then to accommodate the unlinked entries the POS match is relaxed further as to include the cases where the same lemma has POS noun and adjective, the next step ignores the POS completely. In case of multi-word glosses, the stop words are removed and the relaxed condition is tested on each word separately. This covers 98.2 % of the entries in Aramorph. Intrinsic and extrinsic evaluations were performed by comparing SLSA and ArSenL which demonstrated the superiority of SLSA. Nevertheless, SLSA like ArSenL does not include dialect words and cannot accurately analyze social media text.

In [6] a bilingual sentiment lexicon was developed especially for mining Dark Web forums. Two lexicons were developed SentiLEn for English and SentiLAr for Arabic. The Arabic lexicon was constructed by extracting sentiment words related to cyber threats, radicalism, and conflicts from 2000 message posts of Alokab Web forum. Three Arabic language experts annotated the extracted terms’ polarity by giving each term a positive score [0, 1] and a negative score [0, 1]. If a word is always positive its positive score is 1 and the negative score is 0. Similarly, if a word is always negative its negative score is 1 and positive score is 0. For words that are used in both positive as well as negative contexts, positive and negative polarity scores are assigned in the range of 0 and 1 in such a way that their sum is 1. Also two different scores are given for each term for strong and hostile valences. Then the scores given by the three experts are aggregated and normalized to be between [-1,1]. The paper only reported the construction of the lexicon but nothing was reported about validating the lexicon in a real application.

Starting from a small seed list of positive and negative words, Mahyoub et al. [7] used semi-supervised learning to propagate the scores on the Arabic WordNet by exploiting the synset relations. They used the same relations that were used by [8] in developing WordNet-Affect to expand the seed list. These relations include eight semantic/lexical relations {near_synonym, verb_group, see_also_wn15, has_derived, related_to, has_subevent, causes and near_antonym}. The lexicon was evaluated on two corpora’s of movie and book reviews. Although reaching a high accuracy when evaluated, the lexicon still has a low coverage (7576 words) and does not include dialect words.

One of the challenges in sentiment analysis is handling phrases and idioms that convey sentiment. While sentiment words are significant clues to detect sentiment in text, users tend to use common phrases and idioms to express their opinions. These phrases are made up of a different number of words that are usually not sentiment bearing words, and when treated separately by any sentiment analysis algorithm would not be detected as a sentiment clue. Consequently, some efforts have been initiated to deal with this challenge. Authors in [9] constructed an idioms/proverbs lexicon for the Egyptian dialect. They collected 32785 idioms/proverbs from Arabic websites that present directories and encyclopedias of common Egyptian idioms and proverbs. Then they selected 3632 common phrases and manually annotated them for polarity (positive, negative). To check the coverage of this lexicon they developed a technique to detect and extract phrases in text using similarity measures (cosine similarity and Levenshtein distance) combining these measures with n-gram, they reached a 98 % accuracy when applied on tweets and reviews.

The Arabic lexical semantics database (RDI-ArabSemanticDB) [10] was exploited in [11] to construct an Arabic Sentiment Lexicon. The RDI-ArabSemanticDB contains approximately 150,000 Arabic words, 18,413 semantic fields, and 20 semantic relations, including synonyms, antonym, hyponymy and causality. These relations were used to expand a seed list of positive, negative, and neutral words. The lexicon was tested by first comparing it to a translated version of the MPQA lexicon and a manually annotated subset of the lexicon. The results showed that the translation of an English lexicon does not give accurate results. Also the lexicon was tested using different machine learning classifiers of Arabic sentiment using a translated version of the MPQA corpus.

3.2 Corpora and Datasets

Applying sentiment analysis requires a corpus to train a classifier or to evaluate it. This section covers Arabic sentiment analysis researches, and reviews that used corpora. Mostly used corpora were collected from social media; because the content is provided freely, easily, and instantaneously. Users can express, reach, and share opinions in public. Table 2 shows the most available corpora in MSA or dialect which is used in sentiment analysis.

Table 2. Corpora used in ASA techniques

Authors in [5] and [30] used the corpus of [31] which was based on [32]. While the OCA corpus [33] was used by [13] and [34]. The authors in [28] used the HAAD corpus which was produced by [35]. The HAAD [35] minimized and utilized LABR corpus [36]. The authors [37] used the corpus of [38]. [39] utilize the corpus created in [40].