Keywords

1 Introduction

Since the Web 2.0 advent, it is increasingly common to find online valuable opinions or reviews related to products, services, organizations, events and various other items. This is due to the increasing use of social networks, blogs and especially tools that allow users to register their opinions on products or services in e-commerce websites [1]. Capturing and processing such information properly, and discovering the interest of the general public on any item, is of great interest to the business world or to the well-being of a community. The sentiment analysis community is thus developing tools that aim to assist in the recovery and mining of such data [2]. To assist users and online companies, sentiment analysis can be useful in recommender systems to identify if a user’s review on a product or service is positive (thumbs up) or negative (thumbs down) [3]. Web users share opinions about the products or services they consume. Therefore, aggregated information, such as opinions of web users, can be used in the decision-making process of those that prospect such items. Thus, sentiment analysis can be performed using machine learning algorithms, lexical analysis, keywords or ontologies. While many works propose the classification of the polarity of a review, i.e., positive or negative (binary classification), some researches focus on the inference of ratings (multiclass classification) for the opinions [4]. In this case, the main goal is to classify each opinion in respect to a grade range – typically 1 to 5 stars – and not just as positive or negative.

There are two approaches for multiclass classifier construction. The first one is using learning algorithms constructed for multiclass problems, such as Naïve Bayes, induction of Decision Trees, generalization of Neural Networks, and so on. The second one is based on decomposing the initial multiclass problem into a combination of binary problems. There are two classical techniques for transforming the problem: (i) one-versus-one, where a binary classifier is constructed to distinguish between a pair of classes, and so the number of binary classifiers that compose the multiclass classifier is given by the combination of all classes; and (ii) one-vs-all, where a binary classifier is constructed to distinguish a class against all the other ones. These approaches are commonly used when SVM algorithms are indicated to the problem. A newer approach for decomposing the problem is the one implemented by the Nested Dichotomies algorithms [7], which constructs trees of (all the) possible combinations of class binary splits. This algorithm was not used before in sentiment analysis domain.

In this work, we propose a model based on an adaptation of the Nested Dichotomies algorithm, aiming at reducing the number of trees constructed by the algorithm using domain characteristics, i.e., the first tree division is determined by applying and analyzing questionnaires answered by users of the target domain, in order to identify the users’ preference. This paper discusses a methodology to create a model for the classification of user reviews in 5 classes based on binary splits tailored for a specific domain. We present a case study in which the classification model is applied to a set of subjective data extracted from TripAdvisor website (http://www.tripadvisor.com), containing revisions labeled with ratings from 1 to 5. We eventually discuss the process of determining the first division and evaluate the accuracy of the obtained model.

In the next section, we describe other works that focus on sentiment analysis using binary or multiclass classification. In Sect. 3, we discuss the Nested Dichotomies algorithm, in which base our proposal. In Sect. 4, we present our proposal for a domain-tailored multiclass classifier. Finally, in Sect. 5, we draw our conclusions.

2 Related Work

The main research in sentiment analysis aims to measure the positivity of an opinion or review, i.e., to classify it as recommended or not recommended (binary classification). Other works focus on summarization, classification of opinions as either objective or subjective, and inference rating (multiclass classification). Our focus in this work is on ratings classification based on user’s opinion. So, this section presents relevant related work on opinion mining using learning algorithm for (i) classifying opinions on positive or negative, i.e., binary classification; or (ii) classifying opinions on three or more classes, i.e., multiclass classification. In some studies, the authors create a scale that varies with the sentiment intensity, for example, five classes: excellent, good, fair, poor and very poor. In other studies, researchers classify opinions based on ratings, where each rate is represented by a number of stars and each star represents a class. This problem is known as inference rating, which is the focus of this work. Further, some works use lexicon models to classify an opinion, or to enrich the (machine) learning process of a classifier. A lexicon model presents norms and conventions of a language, where each word or expression is associated with a probable sentiment.

2.1 Basic Concepts

According to [8], the sentiment analysis or opinions mining is the field of study that examines the attitudes, emotions, feelings and opinions of people related to entities such as products, services, organizations, events, topics and attributes of these entities. Sentiment analysis is a challenging subarea of Natural Language Processing and can approach many problems, from classification in opinion polarity to the summarization process of general feelings about something.

Considering a text t, the initial task in opinions mining consists in deciding whether t is objective or expresses a feeling. In the latter case, t is subjective and express an opinion O, formally represented as a 5-tuple O = (e, a, s, h, t). Where e is the name of the entity or object to which an opinion relates; a is the specific attribute of that entity; s is sentiment of the author in relation to an attribute or entity; u is the author (user) of the opinion; and t is the date (and time, if available) on which the opinion was created. Table 1 shows an opinion O about a hotel e available at the TripAdvisor website. In this example, one can note that the hotel is the entity e, and one of the attributes are the rooms, indicated by a in Table 1. They are classified by the author as “impressive” and “spacious”, which denote sentiments (subjective evaluations) of the author about the attribute a (the rooms), indicated by s in Table 1.

Table 1. Example of a user comment about a hotel e

The main goal in sentiment analysis is to capture and process opinions generating a list of entities and their ratings, in order to assist a community or a company in better understanding their customer’s needs [9]. To [10], the ideal task in sentiment analysis should be processing a set of opinions about a certain entity, generating a list of attributes and aggregating the sentiments related to the attributes.

2.2 Binary Classification

Many of the work on sentiment analysis has focused on binary classification. A summary of the main works in binary classification can be found in [11], where the main features extraction techniques and learning algorithms used in sentiment analysis are highlighted. Among the major works, we can highlight [12], where the authors evaluate the performance of learning algorithms to determine if the sentiment of an opinion is either positive or negative. Using the comments of a movies database, they show that the algorithms are better than humans classification. Even exploring different pre-processing techniques, the assessed learning algorithms did not reach 90 % accuracy.

In [10], the authors created a method for distinguishing positive and negative reviews on two sites: Amazon.com and C|net. They use different feature selection techniques and evaluate each one, and also test various ways to improve the performance of the lexical analysis method developed by them. By the end, they produced a list for each product, considering the main features extracted from revisions and major revisions for each attribute. Among the challenges they found, they cite inconsistent rating, ambivalence and comparison and small reviews.

Analyzing data from Twitter, in [13] the authors created three databases based on the microblogging messages: positive sentiment, negative sentiment or objective text. These tweets were searched and selected based on emoticons defined as “happy” or “sad”. They compared different pre-processing techniques and used Naive Bayes algorithm. As final result, they obtained 60 to 80 % accuracy.

In [14], the authors proposed a new lexicon model for sentiment analysis of restaurants’ reviews and showed an improvement on Naive Bayes algorithm. They considered negative words and intensity adverbs in the pre-processing phase. They used SVM algorithm, Naive Bayes and an improved Naive Bayes version, proposed by them. They demonstrate that the proposed Naive Bayes technique, when configured with bigrams + unigrams, shows the best result, reaching an 81.2 % accuracy.

2.3 Rating-Inference Problem or Multiclass Classification

In [15], the authors evaluate the accuracy of humans over the task of determining the rating of a comment. They applied an algorithm based on metric labeling that, in some cases, can outperform some versions of SVM and the human baseline in sentiment classification of data, with three or four classes. As a final result, they obtained a 54.6 % accuracy.

In [16], the authors introduce a new type of data pre-processing, where each review is marked with a score, and the rating is inferred based on the set of scores. The authors claim that the model proposed by them exceeds other works in literature by a significant margin.

In [17], the authors proposed a new search in selection of opinions, in order to estimate the ratings for hotels on sites, where users write comments for a particular service. The rating inference was made based on one or more attributes of the hotel. They used Bayesian Networks, where the calculated rating is estimated for each attribute. Finally, the various attributes form the final rating of a hotel. This method produces the best results for the multiclass classification, with a 73.1 % accuracy.

In [18], the authors built a vector with the intensity of features to represent an opinion, and use this vector as input to the learning algorithms. To create this vector, the proposed system identifies the main characteristics that are relevant to a consumer in the product. From these characteristics, the system quantifies each opinion on hotels, creating an array of opinions. They demonstrate that different characteristics of the product have a different impact on the user’s opinion about a hotel. Finally, the system estimates the weight of each product from the intensity vector and estimates the final rating, with a 46.9 % accuracy.

Although focusing on multiclass classification as well, in [6] the authors explored other two types of affective dimensions to classify opinions: the valence and the arousal. They build the feature vectors through tokens extracted considering these two affective dimensions and used a regression model and a SVM algorithm variation to classify a belief in a sense of scale (1–5 scale), obtaining 51.8 % accuracy.

2.4 Sentiment Analysis Applications

In examples ahead, we highlight works focused on different areas ranging from education to social network services. In [19], the authors created a visual analysis of positive and negative reviews of “The Da Vinci Code”. They used a visual tool, TermWatch, to construct a multilayer network of terms based on syntactic, semantic and statistical associations. In order to evaluate the terms that have been previously selected, they use a predictive model based on SVM. They used a set of positive and negative reviews as training features. In this case, a review is decomposed into three components which reflect the presence of positive terms, negative and common in both categories.

In [20], the authors created a web recommendation system using text categorization synopsis of movies stored on IMDB, selected from EachMovie database. They used three algorithms to build a classifier: kNN, Decisions Trees and Naive Bayes. The final performance of the algorithms is around 60–65 % accuracy, with the decision trees showing best results. However, the difference between the three is negligible.

In [21], the authors used emoticons in order to train learning algorithms with opinions taken from Twitter. In addition to emoticons, keywords from Twittratr site that have positive or negative sentiments were used in training. After testing, they reach about 83 % accuracy with Naive Bayes algorithm. Finally, the authors also provide a website, where the users can know the sentiments about something over existing tweets. The site creates a list of positive, negative and neutral tweets, as well as graphics that show which sentiment is prevalent.

In a work in the education area [22], the authors built a model to evaluate Facebook posts and, by detecting the usual user’s humor, check his emotional changes. The application is called SentBuk. This information is used in e-learning systems in order to recommend the most appropriate activities in relation to the student’s humor in a given period. They built a lexicon classifier and when a large number of sentences is classified, they use these messages as training input to the machine learning algorithm. The best result was obtained using the SVM algorithm with 83 % accuracy.

3 Nested Dichotomies Algorithms

The analysis of related work led us to conclude that the best results are obtained when considering the binary opinion classification, and rating classifiers are more difficult to learn. When rating a product or a service with one to five stars, for instance, sometimes it is difficult to the user to decide between the neighbor numbers. Considering the training instances are in \( {\mathbb{R}}_{M} \), i.e., there are \( M \) features describing the data, the difficult to label the training instances (reviews) into classes (ratings) may lead to diffuse margins between the areas of each pair of classes (ratings) in \( {\mathbb{R}}_{M} \), which explains the difficulty to learn the classifiers. Thus, exploring techniques to divide the multiclass problem into binary sub-problems may lead to better results.

In [7], the authors proposed dividing a multiclass classification problem into several binary divisions. This method, called Nested Dichotomies (ND), proposes an alternative based on binary classifiers to the One-vs-All (OxA) and the One-vs-One (OxO) traditional approaches. This model consists in constructing trees recursively, where the root node represents a set of classes A, which is divided into two nodes with new mutually exclusive groups of classes B and C. Considering that the multiclass problem has n classes, in the first iteration, the root contains all the n classes, and the two leaves correspond to two disjoint sets of classes, generated by a random division of A. Thus, n-1 divisions are necessary, creating a binary tree with n leaves. A classifier is constructed for each node of the tree that is not a leave.

For example, a set of five classes needs four divisions and, therefore, four classifiers. This number is smaller than the number of classifiers used by the other two traditional approaches – OxA and OxO. Moreover, the accuracy of ND algorithm, in most cases, exceeds OxA and OxO.

Given a set of classes, it is possible to construct many different trees [23]. So, the original ND algorithm creates a list of k random trees to find an optimal tree model. This number can be informed before the execution ND, and the default is 20 trees in order to find a best result [7].

The final accuracy of the ND classifier is given by the product operator in Eq. 1 [7], where C i1 and C i2 are two subsets of a division of a set of classes Ci in an internal node i, and p(c ∈ C i1 │x, c ∈ C i ) and p(c ∈ C i2 │x, c ∈ C i ) are the conditional probability estimated distributions by the two-class model at node i for an instance x.

$$ {p\left( {c = C |x} \right) \mathop \prod \limits_{i = 1}^{n - 1 } \left( {I\left( {c \in C_{i1} } \right)p\left( {c \in C_{i1} |x, c \in C_{i} } \right) + I\left( {c \in C_{i2} } \right)p\left( {c \in C_{i2} |x, c \in C_{i} } \right)} \right)} $$
(1)

4 Domain-Tailored Multiclass Classification Model

Considering that (i) the results obtained by the binary classification techniques are naturally better than any multiclass classification in sentiment analysis; (ii) to the extent of our knowledge, ND algorithm has not been used for sentiment analysis, specially ratings classification, and (iii) the domain characteristics should be used in the first division, we propose an adaptation of the ND algorithm for rating classification in sentiment analysis to use binary algorithms in sequence to obtain the multiclass classification model.

In this model, the identification of the first two subsets of classes should be determined so that it is more meaningful for the set of users to whom the classifier applies. As such, it is necessary to understand the domain and how the rates will be useful for the users. In the next subsections, we explain our approach to understand a specific sentiment analysis domain and define the classification model by describing our case study on rating hotel reviews.

4.1 Hotel Domain

The hotel domain is frequently used in works that study the rating-inference problem. In this work, we selected a well-known dataset to analyze diverse reviews associated with a rating scale, ranging from 1 to 5. Part of the TripAdvisor reviews reference dataset was used, containing reviews in English. This dataset was used in [24] to find out the latency of the opinion of each individual, and the importance of each aspect in the final rating.

Other works, such as [18], used Booking.com reviews to propose a method to infer ratings. As stated before, in this work the ND algorithm was adapted to be applied by the first time on sentiment analysis problem. This will allow to evaluate the performance of this algorithm, which, as shown by [7], outperformed other methods for multiclass domains in other domains.

4.2 Assessing the Influence of Ratings in Hotel Booking

In order to study the domain and assess how the ratings influence the user when he is selecting a hotel, we conducted a survey to verify the user preferences. In all, 131 users answered the survey, and the final result was in line with research found in the PracticalECommerce website [25].

First of all, we draw the users general profile. Most of the users who responded to the survey were very young, as 52.3 % of them were between 20 and 29 years old, 31.8 % between 30 and 39 years old and 11.2 % between 40 and 49 years old. Only 2.8 % of the respondents were over 50 years old. and 1.9 % were below 19 years old. In addition, they had high education levels, as 42.0 % of the respondents are graduate, 23.4 % had completed a M.Sc. course and 23.4 % had completed a PhD course. Only 12.1 % had completed only high school. Figure 1 shows two graphs the summarize the profile of the respondents.

Fig. 1.
figure 1

The profile of the respondents. In (a) we see the age categorization and in (b) we see the education level categorization.

To understand the relation of the users with online hotel booking services, we first asked in how many sites on average they search a hotel before making a reservation. All the respondents are regular users of hotels booking online services. Before booking a hotel room, 38.3 % of them search options in one or two different sites, 45.9 % search options in three or four different sites and 15.8 % search in five or more than sites. We then asked them how often they check the ratings of the hotels they might want to book when they are selecting a hotel on an online booking service. While 68.4 % users answered they always check, 23.3 % said they check frequently, 4.5 % said they sometimes, 3 % said they check rarely and only 0.8 % said they never check. Figure 2 shows two graphs the summarize these aspects.

Fig. 2.
figure 2

In (a) we see in how many sites on average the users search a hotel before making a reservation. In (b), we see how often the users check the ratings of the hotels they want to book.

To assess how important ratings are for the users in the booking process, initially we asked them if they choose a hotel primarily based on its rating or if they consider other aspects in the first place. We found out that 65.4 % of the users base their decision primarily on the ratings, while only 34.6 % of them would firstly take other aspects in consideration. For those users that base their decision on the ratings, we asked if they would rather select only 5-stars hotels, 4 or 5-stars hotels or 3 to 5-stars hotels (we despised other possibilities in order to make the survey simpler and target our goal). We found out that only 14.9 % of the users would prefer 5-stars hotels, while 46.0 % would prefer a 4 or 5-stars hotels and 39.1 % would choose hotels that are 3-stars or more. Figure 3 shows two graphs the summarize this data.

Fig. 3.
figure 3

In (a), we see the percentage of users that choose a hotel primarily based on its rating and those that consider other aspects in the first place. In (b), we the percentage of users that book preferably 5-stars hotels, 4 or 5-stars hotels or 3 to 5-stars hotels.

As a result, we conclude that the two class groups that are most significant for the users in this domain are 1 to 3-stars hotel and 4 or 5-stars hotels, as 46.0 % of the users that base their booking decisions on ratings regard these groups as non-recommendable and recommendable, respectively.

4.3 The Proposed Model

We propose an architecture for a multiclass classifier for solving the rating-inference problem, based on the Nested Dichotomies algorithm. With this architecture we build a single classifier as a sequence of binary classifiers, as shown in Fig. 4.

Fig. 4.
figure 4

Binary split model for the inference problem ratings in the hotel booking domain

At first, reviews are divided into two main class groups: “recommended” and “not recommended”. From this classification, the class group labeled as “not recommended” is split into two categories: “very bad” and “not so bad”. “Not so bad” class group goes through a new process of binary split generating “bad” and “regular” rating groups. Regarding the class group classified as “recommended”, this will be split in “good” and “very good”. Thus, the final five classes for the views are identified, “very bad”, “bad”, “regular”, “good”, “very good”, each of which corresponds to star rating on the Likert scale (1–5).

This architecture has been tested with several different base-classifiers, and the best results were obtained with the Naive Bayes algorithm, with a 56.6 % accuracy. Although the final accuracy is less than 60 %, the close accuracy, which is calculated considering correct when an instance is classified as a class neighbor to the correct one [26], exceeds 90 %, as in the use of individual algorithms. Besides, the accuracy of the first classification, which divides the instances in the two groups that are represent the users’ preference – not recommended (1, 2 and 3-stars) and recommended (4 and 5-stars) – is 88.9 %. These results overcome many of the earlier works in sentiment analysis, considering the accuracy of the models to binary and multiclass classification. It is worth mentioning also the high accuracy obtained in the classification of the “extreme classes”, i.e. 1 and 5-stars, which exceeds 75 %. This can be explained by the difficulty in analyzing intermediate classes, due to the similarity between them, while the extreme classes can be more easily discriminated.

5 Conclusion

In this article, we propose a new method for rating-inference problem based on an adaptation of the Nested Dichotomies algorithm for dividing the multiclass rating problem into binary divisions. According to previous researches, this solution may be more appropriate than the typical multiclass to binary divisions approaches One-vs-All and One-vs-One, since Nested Dichotomies algorithm exceeds some multiclass classification models for some datasets [7].

The proposed algorithm is able to divide the classification process so that, in the first split, reviews are already separated as good or bad, tailored to a specific domain. The binary division is more thoroughly analyzed in sentiment analysis works, where opinions are usually grouped in three classes as not recommended and two as recommended. As such, well-known binary division algorithms evaluated in previous studies may be used to build the final classification model without need of adaption to the multiclass problem.

Furthermore, we propose a new stage in the process of defining a final model based on supervised machine learning algorithms that can be used in any multiclass classification domain. In addition to all the common steps in the solution of rating-inference problem, we propose to carry out a domain study, in the form of user’s questionnaires or surveys, in order to understand what are the class groups more adequate to the users’ preferences and determine the best possible divisions.

In future work, this method can be tested with different datasets to measure performance in relation to other techniques and algorithms already used in the literature. In addition, this method can be used in ratings recommendation system. In this case, an online system could use the knowledge acquired from the overall users’ reviews and, at each stage, suggest the best division and indicate the reason for the choice of this division, pointing out which words were fundamental to the suggestion of such a division. This can be useful for a user that could know which words were more important in defining the rating for his review, helping to avoid misclassification.