Opinion Mining Using Support Vector Machine with Web Based Diverse Data

Sabuj, Mir Shahriar; Afrin, Zakia; Hasan, K. M. Azharul

doi:10.1007/978-3-319-69900-4_85

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

2780 Accesses
11 Citations

Abstract

Opinions of other people always carry a very important source of information that has a major impact on the entire decision making process. With the emerging availability and popularity of online reviews, opinions, feedback and suggestions, people now actively employ these views for better decision making. Opinion mining is a natural language processing and information extraction task that aims to examine people’s opinions, sentiments, emotions and attitudes about a product. This paper presents an opinion classifier based on Support Vector Machines (SVM) algorithm that can be used to analyze data for classifying opinions. We design a classifier to determine opinion from Bangla text data. We evaluate the performance and analyze comparative results.

You have full access to this open access chapter, Download conference paper PDF

Recent Trends in Opinion Mining using Machine Learning Techniques

Opinion Mining of Restaurant Reviews and Comparison of Different Classifiers

An Overview of Feature Based Opinion Mining

Keywords

1 Introduction

The opinion of others has a great impact on most of us while we make a decision. With the availability of World Wide Web, now we can easily understand the people’s opinion about anything that we want to know. The awareness of web made it possible to get the opinions or experiences of the people who are completely unknown to us. This paper focuses on classifying opinions from text with Web based Diverse Data. It is a challenging task to perform opinion mining from Bangla data as the corpus is very small in size. The Internet has turned into a well-situated medium for people to express their feelings, emotions, attitude and opinions. Automatic opinion mining is a very useful for applications such as news reviews, blog reviews, stock market reviews, movie reviews, travel advice, social issue discussions, consumer complaints, etc. Opinion mining becomes a great interest to the social networking media such as Facebook, Twitter, Instagram, Google+ as well. Using the mined opinion, they can detect some unanticipated posts, shares and comments. But these involve analyzing many languages. This work is done by resource generation which implies building the annotated dataset form Internet. The dataset we used are in English from the web specially from both Amazon product reviews and twitter. We translate these to generate Bangla corpus which are used in this research. We use Support Vector Machine (SVM) for classification task. SVM is not necessarily better than other machine learning methods, but it performs at the state-of-the-art level and has much current theoretical and empirical appeal. Related experimental results show that SVM is subject to attain significantly higher accuracy than traditional mining schemes. We also propose a method to assume polarity strength of an opinion. We label the polarity of each opinion as weak, steady and strong. The accuracy of our classifier for Bangla data shows that our generated Bangla corpus works fine to classify an unknown Bangla opinion with promising accuracy.

2 Related Works

Opinion mining from Bangla text is still in exploratory stage. Some researches have been conducted on sentiment detection, opinion mining and polarity classification of Bangla sentences. Contextual valency is used to detect sentiment from Bangla text. This work uses SentiWordNet for predefined polarity of each words [1]. Another work on sentiment analyzer that constructs phrase patterns and measures sentiment orientation using prior patterns [2]. Opinions or sentiments can be expressed as emotions or feelings, or as opinions, ideas or judgments colored by emotions [3]. Opinion mining deals with the computational treatment of opinion or sentiment in text [4]. Some supervised learning methods to classify Bangla text based on these ideas have been proposed in this decade. A classifier is designed to determine the opinion expressed in both English and Bangla using Naive Bayes. Here strength of each opinion polarity is assumed by probability [5]. The SVM classification algorithm outperforms the others and good results can be achieved using unigrams as features with presence or absence binary values rather than term frequency, unlike what usually happens in topic-based categorization [6]. A hybrid system is proposed to classify overall opinion polarity from less privileged Bangla language that works with linguistic syntactic features [7]. A hybrid approach of Support Vector Machine and Particle Swarm Optimization is used to mine opinion of movie reviews that was effective to increase the accuracy [8].

3 Resource Acquisition

Most opinion mining research efforts in the last decade deal with English text and little work with Bangla text. As Bangla is a less computational ruling language, it is a leading task to build-up a Bangla corpus. The construction of large corpus for Bangla language has historically been a difficult task. Although Bangla text from blogs, newspapers, Twitter, Facebook and many other online sources are available nowadays, it remains crucial to collect a usable set of text, carefully balanced in opinion. Our Bangla corpus is generated from two well-used English datasets. We used Amazon’s watches online reviews [9] with 8,000 opinions for each positive and negative polarity. We ignored neutral reviews with rating 3. Each review is of very high polarity contains up to 10,000 characters. We also used a corpus of already classified tweets in terms of sentiment. This corpus is based on Twitter Sentiment Corpus collected from Sentiment140 [10]. The Twitter Sentiment Analysis Dataset contains 6,00,000 classified tweets. All these tweets are classified by emotion. For example, tweets with emoticon ’:)’ and ’:(’ are classified as positive and negative respectively.

We originate our Bangla corpus from these English datasets using Google Translator in two different ways [5]. A tool is used to translate every single opinion at a time. This may not bear the accurate meaning, but contains enough information with polarity. Another method is dictionary based translation. A dictionary is designed with all of the words from English dataset and each opinion is converted to Bangla word by word. This method also removes noise from Bangla data.

Table 1. Effects of negation (BANGLA)

Full size table

Negation plays an influential role in natural language. It inverses the polarity of a sentence. In Bangla, negation is marked by some specific words (e.g., ). As most of the generated sentences contain only ‘’, ‘’ and ‘’, we consider these three as negation words for Bangla data. In some cases, generated Bangla sentence contains no negation word, though it should. As detecting the influence of negation in Bangla text is problematic, we consider every word in a sentence is compromised and a new feature is created as shown in Table 1. The scope of negation cannot be properly modeled with this representation.

4 Methodology

A support vector machine (SVM) is a supervised learning technique used for both regression and classification based on the concept of decision planes. In mathematical term, SVM constructs separating hyperplane in high-dimensional vector spaces. Suppose, Data points are viewed as $(\varvec{x}, y)$ tuples where $x_j$ is the feature values and y is the class. If we consider multi dimensional feature space, we can define the hyperplane as

$$\begin{aligned} \varvec{b}.\varvec{x} + b_0 = 0 \end{aligned}$$

(1)

This can be defined as following formula:

$$\begin{aligned} f(\varvec{x}^*) = \varvec{b}.\varvec{x}^* + b_0 \end{aligned}$$

(2)

We have to find $\varvec{b}$ and $b_0$ so that we find maximal margin hyperplane.

To maximize margin, each data point must be on the correct side of the hyperplane and at least a distance M from it. We need to relax this requirement to allows some observations to be on the incorrect side of the margin which is called soft margin. So new parameter $\epsilon $ and C are introduced to allow violation. Maximize margin M such that:

$$\begin{aligned} \sum _{j=1}^{p} {b_j}^2 = 1 \end{aligned}$$

(3)

and

$$\begin{aligned} y_i(\varvec{b}.\varvec{x}+b_0) \ge M(1-{\epsilon }_i), {\forall }_i = 1, ..., n \end{aligned}$$

(4)

where,

$$\begin{aligned}\begin{gathered} {\epsilon }_i \ge 0, \sum _{i=1}^{n} {\epsilon }_i \le C \end{gathered}\end{aligned}$$

Parameter C collectively controls how much the individual ${\epsilon }_i$ can be modified to violate the margin. We used this classifier in our experiment.

We have implemented our experiments using scikit-learn [11] tools designed for data mining and data analysis. As text data is not suitable for SVM, we extracted features from raw data in a numerical format. We have used Tfidf Vectorizer that converts a collection of text documents to a matrix of token counts and then transforms it to a normalized representation. We have applied Truncated SVD to perform linear dimensionality reduction using randomized feature selection to keep fixed dimensional vector space. We have used LinearSVC as our classifier which is implemented in terms of liblinear.

5 Experimental Results

In order to test our system, we used inverse-document-frequency for vectorization and selected 2000 features using randomized feature selection. This customized classifier is used in our experiment. 16,000 identical reviews for both Bangla and English from the watch review corpus described in Sect. 3 have been used. Of them, 13,000 reviews are used for training and 3,000 for testing. We have found 86.8% accuracy for Bangla when we consider the negation. And when we ignore negation the accuracy decreases to 85.5%. This is a clear improvement over the 85.0% that results when Nave Bayes is applied for Bangla [5]. Table 2 shows the accuracy of our dataset for both Bangla and English. In Twitter dataset, there are total 6,00,000 data. From these data, 5,99,500 data are selected as training dataset and 500 as testing dataset. We get accuracy 82.0% for SVM and 78.8% for Nave Bayes when we ignore negation. Then by considering negation, accuracy for SVM increases to 82.2% but for Nave Bayes accuracy decreases to 77.6%. For similar Bangla corpus, SVM performs better without negation and gives accuracy 77.3%.

Table 2. Accuracy comparison

Full size table

We classify each polarity with weak, steady and strong label. The confidence score of a classified data point is used to label the polarity. For example, if a data point is classified as negative and confidence score is poor, we assumed this data as weak negative. From our experiment we see that confidence score of all data point forms a normal distribution. We divide this score range to label our classified polarity. Table 3 shows the ranges to label positive and negative polarity. We divide the range $[\mu -3.0\alpha , \mu +3.0\alpha ]$ into 6 portions where $\mu $ is the mean and $\alpha $ is the standard deviation. As score vector forms a normal distribution, 99.9% of the variance retains within this range. To verify the above decisions, we collect some reviews from different individuals about their own watches or products. Each review is then marked as positive or negative according to their opinion in Bangla. We translate these reviews to corresponding English with same meaning and sentiment. Finally, we use these reviews to test our classifier. We observe that our classifier for Bangla which is trained with translated data can identify actual polarity of our collected reviews.

Table 3. Confidence score range to label polarity

Full size table

6 Conclusion

We developed a classifier to identify the polarity of both Bangla and English text data. We proposed a method to assume the strength of classified polarity based on SVM. For Bangla, we generated dataset using translation method. We removed noise from data to make them suitable for classification. We applied support vector machine as supervised learning method and got satisfactory results. We observed that classification using SVM for Bangla outperforms Naive Bayes as well. Reviews from individuals are also used to check accuracy of our classifier and found accurate outcome.

References

Hasan, K.M.A., Rahman, M., Badiuzzaman: Sentiment detection from bangla text using contextual valency analysis. In: 2014 17th International Conference on Computer and Information Technology (ICCIT), pp. 292–295, December 2014
Google Scholar
Hasan, K.M.A., Islam, M.S., Mashrur-E-Elahi, G.M., Izhar, M.N.: Technical Challenges and Design Issues in Bangla Language Processing, first edn. Information Science Reference - Imprint of: IGI Publishing (2013)
Google Scholar
Boiy, E., Hens, P., Deschacht, K., Moens, M.F.: Automatic sentiment analysis in on-line text. In: Chan, L., Martens, B. (eds.) ELPUB, pp. 349–360 (2007)
Google Scholar
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(12), 1–135 (2008)
Article Google Scholar
Hasan, K.M.A., Sabuj, M.S., Afrin, Z.: Opinion mining using naive bayes. In: 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), pp. 511–514, December 2015
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of EMNLP, pp. 79–86 (2002)
Google Scholar
Das, A., Bandyopadhyay, S.: Opinion-polarity identification in Bengali (2010)
Google Scholar
Basari, A.S.H., Hussin, B., Ananta, I.G.P., Zeniarja, J.: Opinion mining of movie review using hybrid method of support vector machine and particle swarm optimization. Procedia Eng. 53, 453–462 (2013)
Article Google Scholar
McAuley, J.J., Targett, C., Shi, Q., van den Hengel, A.: Image-based recommendations on styles and substitutes. In: Baeza-Yates, R.A., Lalmas, M., Moffat, A., Ribeiro-Neto, B.A. (eds.) SIGIR, pp. 43–52. ACM (2015)
Google Scholar
Twitter data from sentiment 140. http://twittersentiment.appspot.com
Python plugin for classifier. http://scikit-learn.org

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Khulna University of Engineering and Technology, Khulna, Bangladesh
Mir Shahriar Sabuj & K. M. Azharul Hasan
Department of Computer Science and Engineering, Northern University Bangladesh, Dhaka, Bangladesh
Zakia Afrin

Authors

Mir Shahriar Sabuj
View author publications
You can also search for this author in PubMed Google Scholar
Zakia Afrin
View author publications
You can also search for this author in PubMed Google Scholar
K. M. Azharul Hasan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mir Shahriar Sabuj .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sabuj, M.S., Afrin, Z., Hasan, K.M.A. (2017). Opinion Mining Using Support Vector Machine with Web Based Diverse Data. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_85

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_85
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Opinion Mining Using Support Vector Machine with Web Based Diverse Data

Abstract

Similar content being viewed by others

Recent Trends in Opinion Mining using Machine Learning Techniques

Opinion Mining of Restaurant Reviews and Comparison of Different Classifiers

An Overview of Feature Based Opinion Mining

Keywords

1 Introduction

2 Related Works

3 Resource Acquisition

4 Methodology

5 Experimental Results

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Opinion Mining Using Support Vector Machine with Web Based Diverse Data

Abstract

Similar content being viewed by others

Recent Trends in Opinion Mining using Machine Learning Techniques

Opinion Mining of Restaurant Reviews and Comparison of Different Classifiers

An Overview of Feature Based Opinion Mining

Keywords

1 Introduction

2 Related Works

3 Resource Acquisition

4 Methodology

5 Experimental Results

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation