Keywords

1 Introduction

Sentiment analysis or opinion mining is one of the hottest fields in research area nowadays, as knowing the sentiments and the opinions of people is crucial in every industry. Millions of reviews are posted everyday. Manually tracking those reviews is a very hard task so sentiment analysis plays an important role in different applications such as reviewing movies, products, etc. [6]. It also could be used to predict the price of stocks from users’ reviews in social media [2], even presidential campaigns could use sentiment analysis to measure the popularity of candidates and predict the winners [23].

Extracting the opinions from twitterFootnote 1 is a challenging task, as users use different languages. Users tend to express their opinions in informal language and the used language is evolving so fast with new abbreviations and acronyms. That created the need for a unified method that does not need any manual tasks to deal with such massive amount of evolving non-structured data. However, most of the research was done for a specific language, sometimes for a specific topic or context and depended on manually created corpus and lexicons which need hard manual work.

Fortunately, twitter has some interesting features that could help in building models that are language-independent and self-learning. One of those features is emoticons/emoji. Even though there is a difference between emoticons and emoji, in this paper, we use them interchangeably as both of them are used to express sentiments and they could be mapped to each other. Emoticons are popular and are commonly used on text messages, as they are able to express the same sentiments among different users, wherever they come from.

Another feature is the geo-location data that is associated with each tweet. This data could enable us to define a model from scratch with emoticons that is specified for a particular region assuming that people from a certain area use the same language.

In this paper, we build a semi-supervised model that uses unlabeled tweets that are gathered by location from USA, then, we use these data to classify English tweets as we know that USA speaks English. Then, we auto-label the tweets using emoticons to generate our training data for our models. Then, we extract features from labeled data by statistical and unsupervised approaches i.e., tf-idf and word2vec. Finally, we apply different classical and deep learning algorithms and combine them to make use of their features. Our model is a unified model, that could be applied on any raw tweets without any prior knowledge of its language or the need for any manual tasks.

2 The Previous Work

In last decades, sentiment analysis held the interest of many researchers. [20] provided a pioneering paper of how to apply machine learning methods such as naive Bayes (NB), Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) on the text classification problem. They used Bag of words (BOW) with unigram, bigram and part of speech (POS) as features. Researchers subsequently attempted to improve the accuracy [1, 11, 24], but most of the work depended on a specific language, certain context or needed different manual tasks.

To overcome those limitations, researchers tried to develop semi-supervised methods i.e. [13] built a model that needed only 3 words of any language and some unlabeled data to auto-generate training labels then classify texts into positive or negative. [7] proved that, depending on emoticons as heuristic data to label raw tweets could be reliable. [17] used emoticons to auto label tweets and applied the approach on four different languages and on multilingual tweets. Although these papers achieved good accuracies, their work used classical methods and new ones were emerged with better accuracies.

Other researchers worked with large neural networks. [4] tried to build a model that can extract most of Natural language processing tasks such as POS, chunking (CHUNK) and Named Entity Recognition (NER) almost from scratch, by building a multi-layer neural network and with using large amount of unlabeled data. In 2013, [16] introduced a revolutionary paper which introduced a word embedding model called word2vec that could represent the semantic meaning of words from their context in an unsupervised way.

Many researchers used word2vec with deep learning algorithms to achieve better accuracy. [10] used a deep Convolutional Neural Network (CNN) by representing each sentence as a list of its word embedding values, then he applied different filters with the width of word2vec representation and with different heights. Other researchers combined different deep learning algorithm to achieve better performance [18], but deep learning methods are very time consuming, vague and require a large amount of data.

3 Theory and Algorithms

The proposed model is a unified model, that could be applied on any raw tweets without any limitations on the used language or how tweets were written. The model is composed of four independent classical and deep learning algorithms that are combined using a voting ensemble. All models are semi-supervised models that use emoticons as a heuristic data to generate the training data. Then, the features get extracted either statistically or with unsupervised techniques i.e. BOW or word2vec. Then, different classifiers are applied i.e. MaxEnt, SVM, Long Short-Term Memory (LSTM) and CNN with the proper extracted features.

This section gives an overview of the steps that we followed to construct each of our models and the techniques that were used in each step.

3.1 Data Processing

The first step in our model is processing the data. Data processing is a crucial task in each text classification problem, as any subsequent step depends on it [14]. In contrast to typical approaches that use language specific operations such as removing stop words and CHUNK, we use only those operations that are common between all languages and those that are related to twitter itself:

  • Identifying emoticons and replacing them with their scores that will be used in the next step to auto-label tweets.

  • Replacing hashtags with their separate words.

  • Replacing twitter’s reserved words such as RT for retweeting and @ for mentions, with place holders.

  • Replacing URLs with placeholders.

  • Reduction of words by allowing only duplicate characters as users tend to repeat characters to emphasize the meanings.

3.2 Auto-Labeling

Auto-Labeling is the second task in our model. In this step, we generate our training data from unlabeled ones. We depend on the sentiment carriers i.e. emoticons or emoji to label the raw data by scoring each tweet with certain score based on the scores of the sentiment carriers that it contains. This task is similar to the work done by [7, 17], but we use the scores of emoji that were provided by [19], not just dividing them into positive and negative ones. Table 1 shows a sample of emoji and their equivalent emoticons with their scores. Each tweet is scored by the average score of all its sentiment carriers, i.e. if the tweet has 2 emoticons and their scores are .9 and .5, then the score of the tweet will be \(\frac{.5\,+\,.9}{2} = .7\). The tweet that has only one sentiment or its \(|score| > .7\) will be considered to be positive or negative, otherwise it will be neglected. By applying our approach on CIKM dataset, we were able to generate 170 K labeled tweets that were used as our training data.

Table 1. Samlpe of Emoji and the equivalent emoticons with their Ranks

3.3 Feature Extraction

The next step is extracting the important features from the available data. As we build multiple models, we use different feature representation techniques. The following section shows an overview of the used feature representation techniques.

TF-idf. Tf-idf is a statistical approach to find the important words in a corpus. It depends on the frequency of a word in a document and its frequency in all the corpora. Tf-idf is calculated as follows:

$$\begin{aligned} tf\textit{-}idf(t,d,D)= & {} tf(t,d) \times idf(t)\\ idf(t)= & {} \log {\frac{n_d}{1+df(d,t)}} \end{aligned}$$

where, t is the word, d is the tweet, tf is the term frequency in the document, \(n_d\) is the number of tweets and df is document frequency where word exists.

BOW. Bag-of-words is one of the most used feature representation techniques [14] in sentiment analysis. The basic idea is to select a set of important words, then each document is represented as a vector of the number of the occurrences of the selected words. BOW does not consider the order of words thus it is often used with n-gram model. Selecting the important words could be done using a language-related features i.e. POS and semantic words or using a statistical approach i.e. Tf-idf as in the proposed model.

Word2Vec. Word2Vec is a word embedding technique that was developed by [16]. It depends on that the meaning of words defined by their company and context [8]. Word2Vec is implemented by training a 2-layer neural network to represent each word as a vector of certain length based on its context. The resulted vectors have some unique features and can solve the analogy problem by performing an arithmetic operation such as \( king - man = queen - woman\).

3.4 Classifiers

The final step is using the classifiers. In our model, we use different classifiers, then we combine them using ensemble classifier to make use of their unique features. In this subsection, we demonstrate the different classifiers that we used.

SVM. Support Vector Machine has shown reliable results in the sentiment classification problem. SVM is one of the first methods applied in this field and is still being used. We found that using this model performs very well with BOW and word2vec summation.

MaxEnt. Maximum Entropy is one of the most used models in a wide range of applications. MaxEnt. follows this equation to find the probability of the output, given a certain input:

$$\begin{aligned} P_{ME}(c|d,\lambda ) = \frac{\exp \left[ \sum _i{\lambda _if_i(c,d)} \right] }{\sum _{\prime {c}}\left[ \exp \sum _i{\lambda _if_i(c,d)} \right] } \end{aligned}$$

In our model, c is a positive or negative, d is the tweet and \(\lambda \) is learned parameters of the model. We found that using this model performs very well with BOW and word2vec summation.

LSTM. Recurrent Neural Network was introduced in early 80s, it was designed to capture patterns in sequential data. RNN typically consists of multiple connected units of the same-structured single-layer neural network. LSTM [9] is an enhancement over RNN to overcome its selectivity and vanishing/exploding gradient problems. LSTM extends the architecture of RNN by replacing the multiple connected single-layer units with a more complex architecture consists of a memory cell, and three gates that control the flow of the state.

CNN. Convolutional Neural Network has attracted many researchers in recent years due to its out-performance, especially in image recognition and text classification [15]. The idea of CNN is to construct a convolutional layer from a set of filters that traverse through the input layer, then, this convolutional layer is followed by a pool layer and a deep neural network. CNN has three variations based on how the model updates the word-embedding values during the training stage: CNN-static, CNN-non-static and CNN-rand. In recent years, CNN has applied in text classification and achieved remarkable results.

Voting Ensembles. Ensemble classifiers have proved their reliability in research and applications [21]. Voting ensembles combine the different classifiers by considering the output of each classifier as a vote. Then, the ensemble takes its final decision based on those votes. Voting ensembles have different implementations i.e., majority voting and Weighted voting ensembles. Majority voting ensemble makes its final decision as the majority of the output of the base classifiers. Weighted voting ensemble is as enhancement over the Majority voting, instead of treating the different classifiers equally, each classifier has its weight/power when it votes. Voting ensemble classifiers often used when the base classifiers are different in architecture and nature. In our model, we combine the output of our base classifiers using majority and weighted voting classifiers that showed a remarkable improvement in the final accuracy.

4 The Proposed Models

The previous section showed all the general steps and techniques to build the models. This section shows how each of the models was built in details, the motivation behind each model and finally, how and why combining the different models generates better results.

4.1 Model 1: BOW with tf-idf

Model. In this model, we use tf-idf to construct our bag-of-words then we apply different classical models i.e. MaxEnt. and SVM. There were different parameters that affected our model such as the loss function, regularization parameter of the classifier, size of the BOW and n-gram size.

Motivation. Bag-of-words is one of the oldest techniques, but it still proves its reliability. BOW depends heavily on the selected n-gram features so it has some limitations such as it does not consider the order of the words and it is limited to the short context which is the size of n-gram. On the other hand, BOW can thrive when tweets are short and have direct meaning or when they contain powerful words such as “won” and “good”.

4.2 Model 2: Aggregation of Word2Vec

Model. This model is similar to work done by [12]. It depends on the fact that each word is represented as a vector, and applying arithmetic operations on those vectors i.e. “mean” and “summation” will give the semantic meaning of the whole sentence. We then apply different classical classifiers such as MaxEnt. and SVM on the output of the arithmetic operations.

Motivation. Word2vec represents words as vectors that represent the semantic meaning of the words. The original paper [16] showed that applying arithmetic operations can solve the word-analogy problems such as \( king - man + woman \) yields to queen. The original paper also showed that, words that have similar meaning are close to each other. So applying arithmetic operations is effective. On the other hand, aggregation of word2vec has its drawback as it does not consider the order of sentence which can affect its ability of the classification.

4.3 Model 3: LSTM

Model. In this model, we represent each document as a list of word-embedding values of its words, then we feed the document representation to the sequential units of LSTM model. Finally, we tune the LSTM model.

Motivation. LSTM is very powerful with sequential data. It can solve the problem of the order of the sentences and the long context due to its sequential architecture and memory units. On the other hand, LSTM is a deep learning algorithm so it needs massive amount of training data and needs much time to train the model.

4.4 Model 4: CNN

Model. We build a model similar to [10], each document is represented as a list of word-embedding values of its words, then we apply a CNN with different filters that has the same width as word-embedding and different heights. In this model, we use word2vec as our word-embedding representation.

Motivation. CNN have proved their validity when using different datasets for training and testing [22]. CNN provides the ability to define the window size of the filters which represent the context in our model. The drawbacks of CNN are same as LSTM as they both are deep learning algorithms.

4.5 Model 5: The Proposed Model

Motivation. We have built four different models. Each model has its own advantages and unique features that could solve the problem of the others. Figure 1 shows the confusion matrix of the number of common correctly classified tweets by the different models against the test dataset. The dark and light colors indicate the relation between the different models. The figure shows that each model can overcome the others on some tweets and all models are independent of each other. Beside the direct output of our models, there are some other features that could give more insights. Figure 2 shows the relation between the classification probability of different models and the actual accuracy at this probability. The figure shows that for all the models, the probability of the classification has a positive linear relationship with the accuracy so it could be considered as a confidence of prediction. From the previous notes, building a classifier that can combine the different models and make use of the different features could improve the final accuracy of the models.

Fig. 1.
figure 1

Number of common correctly classified tweets between models

Fig. 2.
figure 2

Classification probability vs accuracy

Model. To make use of the different models, a weighted voting ensemble classifier is used. The classifier combines the output of the different models and their classification probabilities. Then, each model is assigned a different weight when voting.

5 Experiments

This section describes the different datasets that were used, how the experiments were built and finally, the final results that achieved by the different models.

5.1 The Datasets

CIKM Datasets. This dataset was provided by [3]. It consists of 8M raw tweets and associated geo-location data. The tweets were collected by location from different locations in US with no restriction on the used language, thus it contains some tweets in French, Spanish, etc. We used this dataset to generate our training data by applying our auto-labeling technique.

STS-Test Datasets. STS-test set is one of the most popular datasets that is used in sentiment analysis. It was provided by [7] and it contains 498 manually annotated tweets. The dataset is divided as 177 negative, 139 neutral and 182 positive tweets. STS-test is a part bigger corpus containing additional 1.5M auto-labeled tweets gathered in 2009. The corpus was limited to English language and some categories e.g. products, companies, events, etc.

5.2 The Experimental Protocol

We use CIKM dataset to generate our training data by using auto-labeling technique that is described in Sect. 3.2 and we use STS-test set to test our models. During training and testing cycles, we deal with each tweet as an independent document that express even positive or negative sentiment [14].

In feature extraction step, we found that BOW achieved better results using 8K window size and word2vec with embedding size of 200 was enough, thus all our experiments used these values.

During classification step, SVM and MaxEnt were tuned for the best regularization value. CNN was tuned for different hidden dimension, dropout probability and number of filters. LSTM was tuned for different hidden dimension and dropout probability. Weighted voting ensembles were tuned for different weights of each model in range from 1:5.

Table 2. Summary of the accuracy of all models
Table 3. Proposed model accuracy
Table 4. Comparing the proposed model with the previous models

5.3 The Experimental Results

Table 2 shows the summary of the results achieved by each of the four models. Table 3 compares the results achieved by each of the four models and the results achieved by our proposed model. We can notice that the last model that uses the probability has outperformed all the models. The results show that we have improved the accuracy of the individual models by more than \(1\%\) using a simple voting ensemble. Table 4 shows the previous work done on the same STS-test set. By comparing our results with previous work, we were able to achieve almost the state-of-the-art accuracy with only 170 K labeled tweets, i.e. only 10% of the others‘ training dataset, which saves much time and resources.

6 Conclusions and Future Work

We built a unified model that could be applied without any manual tasks and does not require any information about the used language. We used emoticons as a heuristic data to auto generate our training data. Then, we built multiple classical and deep learning algorithms then combined them to achieve better accuracy. We achieved accuracy near the-state-of-the-art results by using only 170 K of training data i.e. using only 10% of the others’ models that used 1.5M training tweets.

Combining classical and deep learning algorithms improved the accuracy of both. Deep learning algorithms can infer the long and complex sentences correctly and thrive when we have a big amount of data. But training deep learning algorithms consumes time and resources. On the other hand, classical algorithms are simple and could be trained easily and fast. It also can overcome deep algorithms in some situations. Our model combines different models in a parallel manner as it uses voting ensembles. Thus, each model could be independent of the others and we can add more models easily to achieve better accuracy. In the future, we will work on improving the efficiency of ensemble classifier by adding more base models and by finding more parameters that could improve the accuracy of the overall accuracy.