Keywords

1 Introduction

The opinion of others has a great impact on most of us while we make a decision. With the availability of World Wide Web, now we can easily understand the people’s opinion about anything that we want to know. The awareness of web made it possible to get the opinions or experiences of the people who are completely unknown to us. This paper focuses on classifying opinions from text with Web based Diverse Data. It is a challenging task to perform opinion mining from Bangla data as the corpus is very small in size. The Internet has turned into a well-situated medium for people to express their feelings, emotions, attitude and opinions. Automatic opinion mining is a very useful for applications such as news reviews, blog reviews, stock market reviews, movie reviews, travel advice, social issue discussions, consumer complaints, etc. Opinion mining becomes a great interest to the social networking media such as Facebook, Twitter, Instagram, Google+ as well. Using the mined opinion, they can detect some unanticipated posts, shares and comments. But these involve analyzing many languages. This work is done by resource generation which implies building the annotated dataset form Internet. The dataset we used are in English from the web specially from both Amazon product reviews and twitter. We translate these to generate Bangla corpus which are used in this research. We use Support Vector Machine (SVM) for classification task. SVM is not necessarily better than other machine learning methods, but it performs at the state-of-the-art level and has much current theoretical and empirical appeal. Related experimental results show that SVM is subject to attain significantly higher accuracy than traditional mining schemes. We also propose a method to assume polarity strength of an opinion. We label the polarity of each opinion as weak, steady and strong. The accuracy of our classifier for Bangla data shows that our generated Bangla corpus works fine to classify an unknown Bangla opinion with promising accuracy.

2 Related Works

Opinion mining from Bangla text is still in exploratory stage. Some researches have been conducted on sentiment detection, opinion mining and polarity classification of Bangla sentences. Contextual valency is used to detect sentiment from Bangla text. This work uses SentiWordNet for predefined polarity of each words [1]. Another work on sentiment analyzer that constructs phrase patterns and measures sentiment orientation using prior patterns [2]. Opinions or sentiments can be expressed as emotions or feelings, or as opinions, ideas or judgments colored by emotions [3]. Opinion mining deals with the computational treatment of opinion or sentiment in text [4]. Some supervised learning methods to classify Bangla text based on these ideas have been proposed in this decade. A classifier is designed to determine the opinion expressed in both English and Bangla using Naive Bayes. Here strength of each opinion polarity is assumed by probability [5]. The SVM classification algorithm outperforms the others and good results can be achieved using unigrams as features with presence or absence binary values rather than term frequency, unlike what usually happens in topic-based categorization [6]. A hybrid system is proposed to classify overall opinion polarity from less privileged Bangla language that works with linguistic syntactic features [7]. A hybrid approach of Support Vector Machine and Particle Swarm Optimization is used to mine opinion of movie reviews that was effective to increase the accuracy [8].

3 Resource Acquisition

Most opinion mining research efforts in the last decade deal with English text and little work with Bangla text. As Bangla is a less computational ruling language, it is a leading task to build-up a Bangla corpus. The construction of large corpus for Bangla language has historically been a difficult task. Although Bangla text from blogs, newspapers, Twitter, Facebook and many other online sources are available nowadays, it remains crucial to collect a usable set of text, carefully balanced in opinion. Our Bangla corpus is generated from two well-used English datasets. We used Amazon’s watches online reviews [9] with 8,000 opinions for each positive and negative polarity. We ignored neutral reviews with rating 3. Each review is of very high polarity contains up to 10,000 characters. We also used a corpus of already classified tweets in terms of sentiment. This corpus is based on Twitter Sentiment Corpus collected from Sentiment140 [10]. The Twitter Sentiment Analysis Dataset contains 6,00,000 classified tweets. All these tweets are classified by emotion. For example, tweets with emoticon ’:)’ and ’:(’ are classified as positive and negative respectively.

We originate our Bangla corpus from these English datasets using Google Translator in two different ways [5]. A tool is used to translate every single opinion at a time. This may not bear the accurate meaning, but contains enough information with polarity. Another method is dictionary based translation. A dictionary is designed with all of the words from English dataset and each opinion is converted to Bangla word by word. This method also removes noise from Bangla data.

Table 1. Effects of negation (BANGLA)

Negation plays an influential role in natural language. It inverses the polarity of a sentence. In Bangla, negation is marked by some specific words (e.g., ). As most of the generated sentences contain only ‘’, ‘’ and ‘’, we consider these three as negation words for Bangla data. In some cases, generated Bangla sentence contains no negation word, though it should. As detecting the influence of negation in Bangla text is problematic, we consider every word in a sentence is compromised and a new feature is created as shown in Table 1. The scope of negation cannot be properly modeled with this representation.

4 Methodology

A support vector machine (SVM) is a supervised learning technique used for both regression and classification based on the concept of decision planes. In mathematical term, SVM constructs separating hyperplane in high-dimensional vector spaces. Suppose, Data points are viewed as \((\varvec{x}, y)\) tuples where \(x_j\) is the feature values and y is the class. If we consider multi dimensional feature space, we can define the hyperplane as

$$\begin{aligned} \varvec{b}.\varvec{x} + b_0 = 0 \end{aligned}$$
(1)

This can be defined as following formula:

$$\begin{aligned} f(\varvec{x}^*) = \varvec{b}.\varvec{x}^* + b_0 \end{aligned}$$
(2)

We have to find \(\varvec{b}\) and \(b_0\) so that we find maximal margin hyperplane.

To maximize margin, each data point must be on the correct side of the hyperplane and at least a distance M from it. We need to relax this requirement to allows some observations to be on the incorrect side of the margin which is called soft margin. So new parameter \(\epsilon \) and C are introduced to allow violation. Maximize margin M such that:

$$\begin{aligned} \sum _{j=1}^{p} {b_j}^2 = 1 \end{aligned}$$
(3)

and

$$\begin{aligned} y_i(\varvec{b}.\varvec{x}+b_0) \ge M(1-{\epsilon }_i), {\forall }_i = 1, ..., n \end{aligned}$$
(4)

where,

$$\begin{aligned}\begin{gathered} {\epsilon }_i \ge 0, \sum _{i=1}^{n} {\epsilon }_i \le C \end{gathered}\end{aligned}$$

Parameter C collectively controls how much the individual \({\epsilon }_i\) can be modified to violate the margin. We used this classifier in our experiment.

We have implemented our experiments using scikit-learn [11] tools designed for data mining and data analysis. As text data is not suitable for SVM, we extracted features from raw data in a numerical format. We have used Tfidf Vectorizer that converts a collection of text documents to a matrix of token counts and then transforms it to a normalized representation. We have applied Truncated SVD to perform linear dimensionality reduction using randomized feature selection to keep fixed dimensional vector space. We have used LinearSVC as our classifier which is implemented in terms of liblinear.

5 Experimental Results

In order to test our system, we used inverse-document-frequency for vectorization and selected 2000 features using randomized feature selection. This customized classifier is used in our experiment. 16,000 identical reviews for both Bangla and English from the watch review corpus described in Sect. 3 have been used. Of them, 13,000 reviews are used for training and 3,000 for testing. We have found 86.8% accuracy for Bangla when we consider the negation. And when we ignore negation the accuracy decreases to 85.5%. This is a clear improvement over the 85.0% that results when Nave Bayes is applied for Bangla [5]. Table 2 shows the accuracy of our dataset for both Bangla and English. In Twitter dataset, there are total 6,00,000 data. From these data, 5,99,500 data are selected as training dataset and 500 as testing dataset. We get accuracy 82.0% for SVM and 78.8% for Nave Bayes when we ignore negation. Then by considering negation, accuracy for SVM increases to 82.2% but for Nave Bayes accuracy decreases to 77.6%. For similar Bangla corpus, SVM performs better without negation and gives accuracy 77.3%.

Table 2. Accuracy comparison

We classify each polarity with weak, steady and strong label. The confidence score of a classified data point is used to label the polarity. For example, if a data point is classified as negative and confidence score is poor, we assumed this data as weak negative. From our experiment we see that confidence score of all data point forms a normal distribution. We divide this score range to label our classified polarity. Table 3 shows the ranges to label positive and negative polarity. We divide the range \([\mu -3.0\alpha , \mu +3.0\alpha ]\) into 6 portions where \(\mu \) is the mean and \(\alpha \) is the standard deviation. As score vector forms a normal distribution, 99.9% of the variance retains within this range. To verify the above decisions, we collect some reviews from different individuals about their own watches or products. Each review is then marked as positive or negative according to their opinion in Bangla. We translate these reviews to corresponding English with same meaning and sentiment. Finally, we use these reviews to test our classifier. We observe that our classifier for Bangla which is trained with translated data can identify actual polarity of our collected reviews.

Table 3. Confidence score range to label polarity

6 Conclusion

We developed a classifier to identify the polarity of both Bangla and English text data. We proposed a method to assume the strength of classified polarity based on SVM. For Bangla, we generated dataset using translation method. We removed noise from data to make them suitable for classification. We applied support vector machine as supervised learning method and got satisfactory results. We observed that classification using SVM for Bangla outperforms Naive Bayes as well. Reviews from individuals are also used to check accuracy of our classifier and found accurate outcome.