Lexical TF-IDF: An n-gram Feature Space for Cross-Domain Classification of Sentiment Reviews

Dey, Atanu; Jenamani, Mamata; Thakkar, Jitesh J.

doi:10.1007/978-3-319-69900-4_48

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

3720 Accesses
20 Citations

Abstract

Feature extraction and selection is a vital step in sentiment classification using machine learning approach. Existing methods use only TF-IDF rating to represent either unigram or n-gram feature vectors. Some approaches leverage upon the use of existing sentiment dictionaries and use the score of a unigram sentiment word as the feature vector and ignore TF-IDF rating. In this work, we construct n-gram sentiment features by extracting the sentiment words and their intensifiers or negations from a review. Then the score of an n-gram constructed from lexicon of semantic unigram and its intensifier or negation is multiplied to TF-IDF rating to determine the feature score. We experiment with two benchmark data sets for sentiment classification using Support Vector Machine and Maximum Entropy method with cross domain validation by considering training and testing data from two different sets and obtain a substantial improvement in terms of various performance measures compared to existing methods. Cross-domain validation ensures proposed method can be applied for sentiment classification of data sets where example patterns are not available, which typically is the case with commercial data sets.

This work was supported by Ministry of Human Resource Development, [Sanction Letter Number: F. No. 5-5/2014-TS.VII, Dt; 04-09-2014], Department of Higher Education, New Delhi, India.

You have full access to this open access chapter, Download conference paper PDF

Cross-D-vectorizers: a set of feature-spaces for cross-domain sentiment analysis from consumer review

Article 29 April 2019

Sentiment Analysis of IMDb Movie Reviews: A Comparative Analysis of Feature Selection and Feature Extraction Techniques

Feature Extraction and Sentiment Analysis Using Machine Learning

Keywords

1 Introduction

Sentiment analysis, the process of identifying the opinion polarity of a piece of text, is used to analyze the user generated contents from various web resources such as product reviews, movie reviews, citizen opinion of public policies etc. It helps the consumers to research on products or services before making a purchase decision and the organizations to gather data on customer satisfaction and critical feedback to improve upon. There are two broad approaches for sentiment analysis: lexicon based and machine learning based. The lexicon based approaches use natural language processing tools to extract sentiment words from the reviews then find the overall polarity by using sentiment lexicons such as SentiWordNet, SenticNet, VADER (Valence Aware Dictionary for sEntiment Reasoning) [1]. In order to use machine learning based approaches one has to first create lots of example patterns by getting the positive or negative sentiments of real users on reviews of a specific domain; extract the features using natural language tools; find the numeric value of the features using certain mechanism such as TF-IDF rating; train the classifiers such as Nave Bayes (NB) [2], Support Vector Machine (SVM) [3], Maximum Entropy (ME) [4]; and finally use the trained classifiers to determine sentiment polarity of unrated reviews.

Though the machine learning approach shows better performance in many cases [2, 5], it suffers from two problems; first, creating training patterns from real users which is time consuming and expensive [5, 6]; second, selection of right features and their numeric value [5, 7]. To deal with the first problem, many researchers advocate for cross domain validation approach [7], where the training and testing patterns are from two unrelated datasets. To deal with the second problem, many methods are proposed to create feature vectors such as (1) use of TF-IDF rating associated with unigrams, bigrams or in general n-grams, and (2) lexicon-based approach where sentiment score of the feature is used.

In the proposed work, we present a new approach called Lexical TFIDF for creating a feature vector. We construct sentiment n-grams by collecting the appropriate words from the review in consultation with a sentiment dictionary and create senti n-grams by using the intensifier and negation specified in [8]. This is a contrast with the earlier approaches where the features can be any word or n-gram not limited to sentiment words. Specifically, the score of the sentiment lexicon, intensifier and negation are adopted from different sources [1, 8]. Then this score is multiplied with TF-IDF rating to determine the feature weight. Experiment with two benchmark data sets, IMDb (2004), Epinion product reviews; and two classifiers support vector machines and maximum entropy method shows substantial improvement in terms of accuracy and other performance measures for cross-domain validation when compared with existing methods like Mudinas et al. [7] and Tripathy et al. [5].

Rest of the paper is organized as follows. Next section presents the literature review. In Sect. 3, proposed approach is explained. Section 4 represents the experimental results. Section 5 concludes the paper and shows the future scope.

2 Literature Survey

As discussed above selection of right features and their scores is the key to improve the performance of machine learning based approach. TF-IDF and count vectorizer are generally used as features for the text classification [5]. A Few researchers use lexicon based approaches for feature extraction and decide the scores in combination with count vectorizer [7]. Cross-domain validation ensures applicability of sentiment analysis approach to the real world data sets where training patterns are not available or expensive to obtain. In this regard, many attempts have been made in the recent past. In the cross-domain learning problem, the training data set and the target data set are from different sources. For example, Mudinas et al. [7] used training data from Browser(Customer) dataset and testing data from Miscellaneous(Editor) dataset of CNETs software download website.

3 Proposed Approach

Before applying the proposed approach the reviews undergo few NLP steps where (1) stop words such as punctuations, article, etc.; (2) symbols such as @, $, %, etc.; (3) hyperlink and numeric numbers are removed. For example applying these steps, we get only relevant words as shown in Table 1 for three reviews.

Table 1. Reviews for preprocessing

Full size table

Figure 1 is the pictorial representation of the proposed approach with six steps. Detailed explanation of these steps are as follows:

Step 1. Senti n-gram construction and build vectors for each review: When an intensifier (e.g., ‘very’) or negation (e.g., ‘not’) appears before a semantic unigram (e.g., ‘happy’) then we merge them and construct a bigram (e.g., ‘very happy’ and ‘not happy’). Similarly, we construct a trigram if two intensifiers or negations consecutively appears before a semantic unigram (e.g., ‘not very good’). The n-gram word vector for the example reviews in Table 1 are as follows:

Review 1: [“awesome”, “feeling”, “very happy”], Review 2: [“fine”, “not very good”], Review 3: [“not good”]
Step 2. Build senti n-gram matrix for all the reviews: The n-gram word vectors obtained in step 1 are used to construct a matrix (M) taking all the reviews together as shown in below.
$$ M = \left[ \begin{array}{ccc} awesome &{} feeling &{} very \, happy\\ fine &{} not\, \, very \, good &{} - \\ not \, good &{} - &{} - \end{array} \right] $$
Step 3. Construct unique feature vector: The unique senti n-grams extracted from the above matrix are treated as the feature. Continuing with our example the feature vector for the review data set in Table 1 is as follows:

Feature-Vector = [‘awesome’, ‘feeling’, ‘fine’, ‘not good’, ‘not very good’, ‘very happy’]
Step 4. Calculates TF-IDF feature matrix: TF-IDF feature matrix is generated using the above features as columns and reviews as the rows. TF-IDF taring matrix for our problem is:
$$ TF-IDF = \left[ \begin{array}{cccccc} 0.577 &{} 0.577 &{} 0 &{} 0 &{} 0 &{} 0.577\\ 0 &{} 0 &{} 0.707 &{} 0 &{} 0.707 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0\\ \end{array} \right] $$
Step 5. Calculates sentiment score of each feature: The sentiment score of each unigram feature is fetched from VADER lexicon [1] and the scores of other n-grams (Bigrams and Trigrams) are calculated using SO-CAL (Semantic Orientation CALulator) approach [9].

For example, unigram like “awesome”, “feeling”, “fine” having sentiment scores are 3.1, 0.5, 0.8 respectively. For bigrams and trigrams, the well-established lexicon is not available. The SO-CAL approach is used to avoid this situation [9] with a list of intensifiers (amplifier and downtoner) [8], having individual percentage scores. For negation, a constant value 4 is used to shift the semantic word to its opposite polarity.

An example of score calculation for a bigram “very happy”. Suppose, sentiment score of “happy” is “+2.7”, and percentage score of “very” (intensifier) is $+25\%$. Then, the score of “very happy” would be: $+2.7 \times (100\%+25\%)= +3.375$. For the example problem the score vector (S) is as follows:
$$ S = \left[ \begin{array}{ccccccc} Features: &{} `awesome' &{} `feeling' &{} `fine' &{} `not\, good' &{} `not\, very\, good' &{} `very\, happy'\\ Score: &{} 3.1 &{} 0.5 &{} 0.8 &{} -2.1 &{} -1.625 &{} +3.375 \\ \end{array} \right] $$
Step 6. Construct Lexical TFIDF feature matrix: Each feature column of TF-IDF feature matrix is multiplied with the sentiment score of that feature to obtain the Lexical TFIDF. The final Lexical TFIDF matrix for the example problem is:
$$ Lexical \,\, TFIDF = \left[ \begin{array}{cccccc} 1.789 &{} 0.289 &{} 0 &{} 0 &{} 0 &{} 1.947\\ 0 &{} 0 &{} 0.566 &{} 0 &{} -1.149 &{} 0\\ 0 &{} 0 &{} 0 &{} -2.1 &{} 0 &{} 0\\ \end{array} \right] $$

This matrix can be used as input to supervised machine learning algorithms for training. Here, we use two algorithms: ME and SVM. In ME, the feature matrix of training data is generally used to set constraints. The characteristics of training data are then expressed by these constraints which are used for testing [4]. SVM method takes a decision by drawing the hyper-planes boundary between two classes in an optimal way [3]. Many papers show SVM and ME outperform other algorithms [2, 5]. Proposed feature selection approach along with above two classifiers is compared to three existing methods in terms of four performance metrics: accuracy, precision, recall, F1_Score.

Table 2. Performance evaluation for cross-domain (IMDb (2004) and Epinion) classification among different approaches

Full size table

4 Experimental Result

We experiment with two real world data sets and two classifiers as discussed above. The data sets are IMDb (2004) and Epinion. IMDb (2004) is polarity dataset consisting of 1000 positive and 1000 negative movie reviews [10]. Where as, Epinion is a collection of 400 reviews of 8 different products: cars, books, cookware, computers, movies, hotels, phones and music. Each category contains 25 positive and 25 negative reviews [11]. For experiment purpose we consider the reviews corresponding to books, cars and computers. We use Python 3.5 with NLTK (for pre-processing) and Sklearn (for feature discovery and classification).

Table 2 shows the comparison of cross-domain classification where our method outperforms other methods 81.25% times considering all the performance measures. However, in all of the experiments, the proposed approach achieves highest accuracy and precision using ME or SVM.

5 Conclusion

In this work, we construct n-gram sentiment features by first extracting the sentiment words and their intensifiers from reviews. The scores corresponding to these features are obtained from the existing sentiment lexicons. Proposed Lexical TFIDF matrix is constructed by multiplying TF-IDF rating with feature score. We experiment with two benchmark data sets and two well known classifiers with cross domain validation shows our approach outperforms in 81.25% cases considering all the performance measures, hence, can be used for real data sets where example patterns are not available. In future, we plans to improve upon the proposed method, mathematically analyze the robustness of the method and apply to real case studies.

References

Hutto, C.J., Gilbert, E.: Vader: a parsimonious rule-based model for sentiment analysis of social media text (2014)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86 (2002)
Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)
Google Scholar
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, vol. 1, pp. 61–67 (1999)
Google Scholar
Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)
Article Google Scholar
Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005). doi:10.1007/11430919_37
Chapter Google Scholar
Mudinas, A., Zhang, D., Levene, M.: Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 5. ACM (2012)
Google Scholar
Brooke, J.: A semantic approach to automated text sentiment analysis. Ph.d. thesis, Simon Fraser University (2009)
Google Scholar
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011)
Article Google Scholar
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 271 (2004)
Google Scholar
Taboada, M., Grieve, J.: Analyzing appraisal automatically. In: Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (AAAI Technical Report SS-04-07), Stanford University, CA, pp. 158–161. AAAI Press (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

E-Business Centre of Excellence Lab (ECO), Department of Industrial and Systems Engineering, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Atanu Dey, Mamata Jenamani & Jitesh J. Thakkar

Authors

Atanu Dey
View author publications
You can also search for this author in PubMed Google Scholar
Mamata Jenamani
View author publications
You can also search for this author in PubMed Google Scholar
Jitesh J. Thakkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atanu Dey .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dey, A., Jenamani, M., Thakkar, J.J. (2017). Lexical TF-IDF: An n-gram Feature Space for Cross-Domain Classification of Sentiment Reviews. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_48
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Lexical TF-IDF: An n-gram Feature Space for Cross-Domain Classification of Sentiment Reviews

Abstract

Similar content being viewed by others

Cross-D-vectorizers: a set of feature-spaces for cross-domain sentiment analysis from consumer review

Sentiment Analysis of IMDb Movie Reviews: A Comparative Analysis of Feature Selection and Feature Extraction Techniques

Feature Extraction and Sentiment Analysis Using Machine Learning

Keywords

1 Introduction

2 Literature Survey

3 Proposed Approach

4 Experimental Result

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Lexical TF-IDF: An n-gram Feature Space for Cross-Domain Classification of Sentiment Reviews

Abstract

Similar content being viewed by others

Cross-D-vectorizers: a set of feature-spaces for cross-domain sentiment analysis from consumer review

Sentiment Analysis of IMDb Movie Reviews: A Comparative Analysis of Feature Selection and Feature Extraction Techniques

Feature Extraction and Sentiment Analysis Using Machine Learning

Keywords

1 Introduction

2 Literature Survey

3 Proposed Approach

4 Experimental Result

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation