Keywords

1 Introduction

World Wide Web has emerged as the largest repository of the user generated texts consisting of opinions. Suggestions from different people exist on the Internet and Internet users, nowadays, use forums, news blogs, discussion groups or review sites for opinions and suggestions even while taking a smallest decision e.g. buying a routine device [1]. The opinions can be defined as the subjective expressions that describe people’s feelings [2], sentiments, or appraisals towards objects, procedures, events and their characteristics.

Sites, forums or discussion groups gathering opinions consist of bulks of data making difficult for a person to search for relevant opinions manually. Moreover, survey companies may also need such data to carry out a research about any product, person or political party. Hiring individuals for this job would be costly and time consuming. Therefore, a system is needed which automatically mines through such data to get relevant opinions or suggestions about any specific thing.

Different approaches exist for opinion mining like supervised, unsupervised, lexicon based approaches. These techniques have been used for mining opinions of different languages like English, Persian, Hindi, Turkish but none has been used for Urdu language. Our proposed system focuses on Urdu language as it is a major language spoken and understood around the globe with 80 million speakers in the subcontinent [3]. It poses certain challenges due to its complex morphology and orthography. These challenges have to be overcome during Urdu language processing.

Different forums and social media sites are localizing their content by allowing users to comment and chat in their native language. Such forums are tremendously increasing for Urdu language as well where people can add their suggestions in Urdu. Due to this, we have proposed an approach for Urdu language which analyzes the data and highlights positive and negative opinions. In this work, Lexicon-based approach has been implemented.

Section 2 reviews related work in opinion mining, Sect. 3 briefly describes the implementation of two-step lexicon based opinion mining model for Urdu (LOMMU), Sect. 4 discusses the evaluation results of LOMMU, whereas, Sect. 5 concludes the paper along-with some future directions.

2 Literature Survey

Many different approaches for opinion mining have been proposed by different researchers. Learning methods that are supervised, unsupervised, and semi-supervised in nature have been used by some of them. Unsupervised learning methods have been increasingly successful in recent NLP research mainly because it takes unlabeled data as input. Moreover, unsupervised learning results in better understanding of modeling methods, optimization of algorithms and conversion of domain knowledge into structured models. Sentiment analysis also uses lexicon based approach with un-supervised learning method. Three different approaches are used to construct sentimental lexicon - manual, dictionary-based or corpus-based approach.

Naïve Bayes algorithm is the most widely used supervised classification model [4]. It estimates the probabilities of opinions (as positive or negative) using the joint probabilities of a set of words in a given category. Support Vector Machine (SVM) is a non-probabilistic binary classification method proposed by Vladimir Vapnik. It looks for a hyper plane with the maximum margin between positive and negative examples of the training opinions. In addition to the above, K-Nearest Neighbor (KNN) classification (KNN) is based on the assumption that the classification of an instance is most similar to classification of other instances that are nearby in the vector space. In comparison to the other classification methods such as Naïve Bayes, KNN does not rely on prior probabilities and is computationally efficient [5]. Naïve Bayes, SVM, and KNN classifiers discussed above have been used for English language opinion mining. All these are termed as supervised learning methods. Another technique has been proposed which performs classification based on some fixed syntactic patterns that are likely to be used to express opinions [6]. Lexicon-based approaches have also been used by English language. Comprehensive lexicons have been constructed for English language like SentiWordNet 3.0 which is publically available and is used by different researchers for opinion extraction [7].

Cross-domain sentiment analysis has been done by many researchers [8] experimented with German emails. German emails are converted to English for calculating sentiment orientations, after which they are again converted to German. Precision of this system is satisfactory but recall is recorded to be quite poor. In [9], a slightly different problem has been attempted by using a maximum entropy-based EM algorithm. It jointly learns two monolingual sentiment classifiers by treating the sentiment labels in the unlabeled parallel text as unobserved latent variables.

Urdu is a morphologically complex language having a different writing style due to which using cross-domain sentiment analysis technique for Urdu opinion mining would be quite difficult. Moreover, Urdu data available online is unlabeled and less data is available for analysis. Therefore, less work related to lexicon implementation has been done using corpus. One of the most comprehensive Urdu language lexica is available at http://www.cle.org.pk [3]. This data is XML based, as per the annotation schema, containing about 20 etymological, phonetic, morphological, syntactic, semantic and other parameters of information about a word. Another lexicon proposed by [10] has been constructed from news Urdu corpus having 1.5 million words. It has been tokenized on space and punctuation marks, keeping the diacritics. Extracted lexicon contains 9,126 total words and 4,816 unique words. These lexicons do not contain enough data for decision making. Moreover, accuracy of these lexicons is not as good as described by the researchers.

Urdu lexicon development involves decisions regarding parts-of-speech (POS) tags and their respective features, lemmas, transcription, and lexicon format. POS tagger used for Urdu lexicon development tags sentence on the basis of noun, verb, adjective, adverb, numeral, postpositions, conjunctions, pronouns, auxiliaries, case markers, harf, etc. Most of the on-going works have used XML based lexicon formats [11, 12]. Construction of such lexicons is time consuming as each scenario has a detailed information attached to it.

An alternate solution to XML based would be a Java Script Object Notation (JSON) based two-step lexicon approach as it is easy to implement and is less time consuming. It consists of different keys and each key has corresponding values associated with it. Proposed lexicon-based opinion mining model for Urdu (LOMMU) using JSON format is described in the next section.

3 Lexicon-Based Opinion Mining Model for Urdu Language (LOMMU)

Developing a lexicon for opinion mining is quite critical. LOMMU can be implemented for any operating system (OS) but our work has been tested with Macintosh OS. It uses an algorithmic approach to develop a two step lexicon. JSON format based lexicon structure is shown in Fig. 1.

Fig. 1.
figure 1

JSON Format-based Lexicon Structure

JSON format given above gives detailed information about the word “” {Mardon, Men}. It contains gender information, number, phonetics, case, and lemma of the candidate word.

Raw corpus has been annotated using a POS tagger and then adji-units are extracted from the given text. All the decisions which are made during opinion mining use adji-units. Negations have also been handled in our system by using them as polarity shifters. LOMMU uses a two-step lexicon consisting of positive and negative lexemes. Extracted adji-units from the text under consideration are compared with the lexemes and in case of negations attached with adji-units, the polarity of the sentence shifts. System overview has been given in Fig. 2.

Fig. 2.
figure 2

Two Step Lexicon based Opinion Mining Model for Urdu Language

Here, we define our problem of Urdu opinion mining. Let “O” be the Urdu text consisting of sentences which can be factual or opinionated. So we can say that “O” is a union of factual and opinionated sentences:

$$ {\text{O }} = \, \left\{ {{\text{set}}\,{\text{ of }}\,{\text{factual}}\,{\text{ sentences}}} \right\}{\text{ U }}\left\{ {{\text{set}}\,{\text{ of }}\,{\text{opinionated}}\,{\text{ sentences}}} \right\} $$

LOMMU differentiates opinionated sentences from factual ones because of the significance of opinionated sentences in opinion mining. The main tasks of LOMMU can be described as follows:

  • Convert Gathered Data into UTF-16 Format for Processing: Gathered data is in different forms and hence, cannot be processed as it is. Therefore, it is converted to UTF-16 format for further processing.

  • Normalization of Data: The converted data is then normalized by removing dots, punctuation marks, commas, dashes or any irrelevant symbols. This step can be referred to as a preprocessing step.

  • Tokenization: Extracting each word from a sentence is known as the process of tokenization. Tokens are just separated by spaces and may not be complete meaningful words. Some examples of tokens for the following sentence are given below:

  • Segmentation: It is a process of extracting meaningful words. Some tokens are not meaningful words as said in the previous step; therefore, segmentation is needed to get a complete meaningful segment of a word. Examples of segments for the same sentences given above are shown below:

  • Shallow parsing: Adji-units are extracted for opinion mining using shallow parsing after annotating the corpus. Any POS tagger e.g. CRULP POS tagger can be used for annotating the corpus. Phrase level negations are also handled as part of shallow parsing. Examples of shallow parsing are given below with reference to the sentence given above.

  • Adji-units Analysis: Adji-units are then compared with the positive and negative lexemes in the lexicon. Due to this, it is known as two-step lexicon based opinion mining model for Urdu language. The presence of the word ( {very}) in a sentence enhances the intensity of that sentence (either positively or negatively). Overall polarity of the sentence is then calculated. Adji-units which do not match with either positive or negative entries in the lexicon have been entered manually for efficient processing. lexicons.

4 Evaluation of LOMMU

LOMMU has been evaluated by using sample text files consisting of sentences and 10,000 tagged words downloaded from http://www.cle.org.pk. First of all, tag-set has been selected to extract adji-units. Secondly, extracted adjectives have been compared with positive and negative lexemes in the lexicon. Finally, results have been discussed along-with the system accuracy. Figures 3 and 4 show complete working of LOMMU.

Fig. 3.
figure 3

Read Urdu Tagged File, Tag-set Selection and Adji-units Extraction

Fig. 4.
figure 4

Lexicon having Negative and Positive Lexemes; and Extracted Adji-units

Developed LOMMU reads a text file for which polarity has to be calculated. It extracts the list of adji-units from the tagged words by retaining the words with a tag <JJ> . Negations and other factors that may increase or decrease polarities are stored at backend to be used while calculating final results. Extracted adji-units are then compared with the entries in the lexicon and sentence-by-sentence analysis is conducted for making the overall decision.

Figure 3 shows adji-units extracted from the given file and results given by LOMMU.

LOMMU has been tested with test data and results obtained are satisfactory. Some sample texts have been discussed here along-with the results given by our system when these texts are passed through it.

Sample 1 contains reviews about laptop taken from a discussion forum. More than 400 words have been minimized to around 160 words making a complete paragraph. This data has then been tagged using an existing POS tagger. LOMMU has read data word by word and extracted adji-units as shown by underlined words in the text. Negations attached with any of the extracted adji-unit have been stored at backend for final decision-making.

When we pass this text from our system it matches 4 positive lexemes and 2 negative lexemes but skips others. Positive lexemes have been shown by simple underlined words whereas negative lexemes have been shown with dotted underlined words. Skipped lexemes can be manually added into the existing lexicon list for future use. Overall results for sample 1 have been discussed in Table 1 below. For the given text in sample, the accuracy of LOMMU lexicon is 60%.

Table 1. Sample 1 results

Sample 2 discusses reviews about an employee of a software firm. Here, 350 words have been reduced to 250 words making a complete paragraph. This sample data has also been converted to tagged data using POS tagger. Extracted adji-units have been shown by the underlined words in the text above. Simple underlined words are positive lexemes whereas dotted underlined words are negative lexemes.

Sample 2, when passed through our system, matches 8 positive lexemes and 2 negative lexemes. In this case, the overall accuracy of LOMMU lexicon is recorded as 55.55% which has been given in Table 2 below.

Table 2. Sample 2 results

Entire experimentation has been conducted using different corpuses having around 100,000 words. This experiment has given a decreased accuracy of 50–52%. The main reason for this accuracy decline is that our lexicon is not mature enough and contains only 15,000 words (adji-units). Adji-units can be added manually for efficient processing. Increasing the number of adji-units in the lexicon would definitely increase the LOMMU accuracy.

LOMMU presents a sentiment-annotated lexicon for mining opinionated positive and negative expressions of any given Urdu text. It is an integral basis of Urdu text based sentiment analysis. LOMMU gives an accuracy of about 50–52% with just 15,000 adji-units in the lexicon. Increasing this further would definitely increase the LOMMU lexicon accuracy.

Moreover, all the existing sentiment analysis systems are for Windows platform. Our system, on the other hand, provides a platform for Macintosh users.

5 Conclusion and Future Work

Two-step lexicon based opinion mining model has been proposed for Urdu language which uses a JSON based approach for constructing the lexicon. For each word in the lexicon, detailed information has been given. It has been tested with different corpuses having about 100,000 words. The system gives an accuracy of about 50-52% as our lexicon consists of only 15000 words.

Future work associated with it would be the enhancement of developed lexicon by adding more words so that high system accuracy can be achieved.