Keywords

1 Introduction

Sentiment analysis, the process of identifying the opinion polarity of a piece of text, is used to analyze the user generated contents from various web resources such as product reviews, movie reviews, citizen opinion of public policies etc. It helps the consumers to research on products or services before making a purchase decision and the organizations to gather data on customer satisfaction and critical feedback to improve upon. There are two broad approaches for sentiment analysis: lexicon based and machine learning based. The lexicon based approaches use natural language processing tools to extract sentiment words from the reviews then find the overall polarity by using sentiment lexicons such as SentiWordNet, SenticNet, VADER (Valence Aware Dictionary for sEntiment Reasoning) [1]. In order to use machine learning based approaches one has to first create lots of example patterns by getting the positive or negative sentiments of real users on reviews of a specific domain; extract the features using natural language tools; find the numeric value of the features using certain mechanism such as TF-IDF rating; train the classifiers such as Nave Bayes (NB) [2], Support Vector Machine (SVM) [3], Maximum Entropy (ME) [4]; and finally use the trained classifiers to determine sentiment polarity of unrated reviews.

Though the machine learning approach shows better performance in many cases [2, 5], it suffers from two problems; first, creating training patterns from real users which is time consuming and expensive [5, 6]; second, selection of right features and their numeric value [5, 7]. To deal with the first problem, many researchers advocate for cross domain validation approach [7], where the training and testing patterns are from two unrelated datasets. To deal with the second problem, many methods are proposed to create feature vectors such as (1) use of TF-IDF rating associated with unigrams, bigrams or in general n-grams, and (2) lexicon-based approach where sentiment score of the feature is used.

In the proposed work, we present a new approach called Lexical TFIDF for creating a feature vector. We construct sentiment n-grams by collecting the appropriate words from the review in consultation with a sentiment dictionary and create senti n-grams by using the intensifier and negation specified in [8]. This is a contrast with the earlier approaches where the features can be any word or n-gram not limited to sentiment words. Specifically, the score of the sentiment lexicon, intensifier and negation are adopted from different sources [1, 8]. Then this score is multiplied with TF-IDF rating to determine the feature weight. Experiment with two benchmark data sets, IMDb (2004), Epinion product reviews; and two classifiers support vector machines and maximum entropy method shows substantial improvement in terms of accuracy and other performance measures for cross-domain validation when compared with existing methods like Mudinas et al. [7] and Tripathy et al. [5].

Rest of the paper is organized as follows. Next section presents the literature review. In Sect. 3, proposed approach is explained. Section 4 represents the experimental results. Section 5 concludes the paper and shows the future scope.

2 Literature Survey

As discussed above selection of right features and their scores is the key to improve the performance of machine learning based approach. TF-IDF and count vectorizer are generally used as features for the text classification [5]. A Few researchers use lexicon based approaches for feature extraction and decide the scores in combination with count vectorizer [7]. Cross-domain validation ensures applicability of sentiment analysis approach to the real world data sets where training patterns are not available or expensive to obtain. In this regard, many attempts have been made in the recent past. In the cross-domain learning problem, the training data set and the target data set are from different sources. For example, Mudinas et al. [7] used training data from Browser(Customer) dataset and testing data from Miscellaneous(Editor) dataset of CNETs software download website.

3 Proposed Approach

Before applying the proposed approach the reviews undergo few NLP steps where (1) stop words such as punctuations, article, etc.; (2) symbols such as @, $, %, etc.; (3) hyperlink and numeric numbers are removed. For example applying these steps, we get only relevant words as shown in Table 1 for three reviews.

Table 1. Reviews for preprocessing

Figure 1 is the pictorial representation of the proposed approach with six steps. Detailed explanation of these steps are as follows:

Fig. 1.
figure 1

Block diagram of Lexical TF-IDF

  • Step 1. Senti n-gram construction and build vectors for each review: When an intensifier (e.g., ‘very’) or negation (e.g., ‘not’) appears before a semantic unigram (e.g., ‘happy’) then we merge them and construct a bigram (e.g., ‘very happy’ and ‘not happy’). Similarly, we construct a trigram if two intensifiers or negations consecutively appears before a semantic unigram (e.g., ‘not very good’). The n-gram word vector for the example reviews in Table 1 are as follows:

    Review 1: [“awesome”, “feeling”, “very happy”], Review 2: [“fine”, “not very good”], Review 3: [“not good”]

  • Step 2. Build senti n-gram matrix for all the reviews: The n-gram word vectors obtained in step 1 are used to construct a matrix (M) taking all the reviews together as shown in below.

    $$ M = \left[ \begin{array}{ccc} awesome &{} feeling &{} very \, happy\\ fine &{} not\, \, very \, good &{} - \\ not \, good &{} - &{} - \end{array} \right] $$
  • Step 3. Construct unique feature vector: The unique senti n-grams extracted from the above matrix are treated as the feature. Continuing with our example the feature vector for the review data set in Table 1 is as follows:

    Feature-Vector = [‘awesome’, ‘feeling’, ‘fine’, ‘not good’, ‘not very good’, ‘very happy’]

  • Step 4. Calculates TF-IDF feature matrix: TF-IDF feature matrix is generated using the above features as columns and reviews as the rows. TF-IDF taring matrix for our problem is:

    $$ TF-IDF = \left[ \begin{array}{cccccc} 0.577 &{} 0.577 &{} 0 &{} 0 &{} 0 &{} 0.577\\ 0 &{} 0 &{} 0.707 &{} 0 &{} 0.707 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0\\ \end{array} \right] $$
  • Step 5. Calculates sentiment score of each feature: The sentiment score of each unigram feature is fetched from VADER lexicon [1] and the scores of other n-grams (Bigrams and Trigrams) are calculated using SO-CAL (Semantic Orientation CALulator) approach [9].

    For example, unigram like “awesome”, “feeling”, “fine” having sentiment scores are 3.1, 0.5, 0.8 respectively. For bigrams and trigrams, the well-established lexicon is not available. The SO-CAL approach is used to avoid this situation [9] with a list of intensifiers (amplifier and downtoner) [8], having individual percentage scores. For negation, a constant value 4 is used to shift the semantic word to its opposite polarity.

    An example of score calculation for a bigram “very happy”. Suppose, sentiment score of “happy” is “+2.7”, and percentage score of “very” (intensifier) is \(+25\%\). Then, the score of “very happy” would be: \(+2.7 \times (100\%+25\%)= +3.375\). For the example problem the score vector (S) is as follows:

    $$ S = \left[ \begin{array}{ccccccc} Features: &{} `awesome' &{} `feeling' &{} `fine' &{} `not\, good' &{} `not\, very\, good' &{} `very\, happy'\\ Score: &{} 3.1 &{} 0.5 &{} 0.8 &{} -2.1 &{} -1.625 &{} +3.375 \\ \end{array} \right] $$
  • Step 6. Construct Lexical TFIDF feature matrix: Each feature column of TF-IDF feature matrix is multiplied with the sentiment score of that feature to obtain the Lexical TFIDF. The final Lexical TFIDF matrix for the example problem is:

    $$ Lexical \,\, TFIDF = \left[ \begin{array}{cccccc} 1.789 &{} 0.289 &{} 0 &{} 0 &{} 0 &{} 1.947\\ 0 &{} 0 &{} 0.566 &{} 0 &{} -1.149 &{} 0\\ 0 &{} 0 &{} 0 &{} -2.1 &{} 0 &{} 0\\ \end{array} \right] $$

This matrix can be used as input to supervised machine learning algorithms for training. Here, we use two algorithms: ME and SVM. In ME, the feature matrix of training data is generally used to set constraints. The characteristics of training data are then expressed by these constraints which are used for testing [4]. SVM method takes a decision by drawing the hyper-planes boundary between two classes in an optimal way [3]. Many papers show SVM and ME outperform other algorithms [2, 5]. Proposed feature selection approach along with above two classifiers is compared to three existing methods in terms of four performance metrics: accuracy, precision, recall, F1_Score.

Table 2. Performance evaluation for cross-domain (IMDb (2004) and Epinion) classification among different approaches

4 Experimental Result

We experiment with two real world data sets and two classifiers as discussed above. The data sets are IMDb (2004) and Epinion. IMDb (2004) is polarity dataset consisting of 1000 positive and 1000 negative movie reviews [10]. Where as, Epinion is a collection of 400 reviews of 8 different products: cars, books, cookware, computers, movies, hotels, phones and music. Each category contains 25 positive and 25 negative reviews [11]. For experiment purpose we consider the reviews corresponding to books, cars and computers. We use Python 3.5 with NLTK (for pre-processing) and Sklearn (for feature discovery and classification).

Table 2 shows the comparison of cross-domain classification where our method outperforms other methods 81.25% times considering all the performance measures. However, in all of the experiments, the proposed approach achieves highest accuracy and precision using ME or SVM.

5 Conclusion

In this work, we construct n-gram sentiment features by first extracting the sentiment words and their intensifiers from reviews. The scores corresponding to these features are obtained from the existing sentiment lexicons. Proposed Lexical TFIDF matrix is constructed by multiplying TF-IDF rating with feature score. We experiment with two benchmark data sets and two well known classifiers with cross domain validation shows our approach outperforms in 81.25% cases considering all the performance measures, hence, can be used for real data sets where example patterns are not available. In future, we plans to improve upon the proposed method, mathematically analyze the robustness of the method and apply to real case studies.