Keywords

1 Introduction

Opinion identification and their automatic classification in digital documents have been addressed in several papers since the beginning of 2000 [2, 3], which belong to the Sentiment Analysis or Opinion Mining tasks. Aspect Based Sentiment Analysis (ABSA) is one of the Sentiment Analysis subtasks, which allows extracting more information from opinions [2]. An aspect term is associated with the possible characteristics about products, services, events or people. In the ABSA subtask, an effective extraction of aspects is very important to achieve a correct sentiment classification [3].

Some methods perform aspect extraction in a single knowledge domain [7, 13]. These methods have had good results for a single domain, but when they have been applied in different domains, their effectiveness decreases. Several of these proposals use Deep Learning techniques [7, 13, 14] for aspect extraction, but they only deal with a single domain or at most two domains. This limitation in terms of the number of domains prevents the learning of common features in different datasets and, therefore, reduces the scope of these proposals.

It is important during the learning process not to lose effectiveness in each knowledge domain and to obtain the patterns or features that may be common to several domains (e.g.; the price aspect is common for restaurant, hotel and electronic device domains). For this reason, Lifelong Learning strategy is useful in aspect extraction. It takes advantage of the local learning of several domains by identifying the common features or patterns found in the previous learning process without losing the effectiveness when learning a new domain [3].

When using Lifelong Learning with neural networks, it is necessary to avoid catastrophic forgetting. This occurs when networks are sequentially trained in many tasks; because the network weights estimated for a task A can be modified in the learning process of a task B [10]. Some proposals have tried to overcome the catastrophic forgetting; for instance, in image classification [10] and game strategy learning [9]. Nevertheless, there are few proposals devoted to solve the catastrophic forgetting in ABSA subtasks [5].

In this paper, we propose a new model for extracting aspects in multi-domains based on the combination of Convolutional Neural Networks (CNN) [4] and Lifelong Learning [3]. The main contribution of this paper is to reduce the catastrophic forgetting in a Lifelong Learning framework for aspect extraction. The rest of this paper is organized as follow. The principal aspect extraction methods based on Deep Learning techniques are presented in Sect. 2. Section 3 explains the proposed model for aspect extraction based on Deep and Lifelong Learning. Section 4 presents the evaluation of our model with respect to another proposal of the state-of-the-art. Towards the end, we provide concluding remarks and future research directions.

2 Related Work

There are several works which use Deep Learning for solving the ABSA subtask [2, 5, 7, 13]. Some researchers model the ABSA subtask as a sequence of characteristics to be learned by a method. This point of view has motivated the use of the Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) for aspect extraction [5]. This type of models overcome the vanish or explode gradient problem during the backpropagation process. As advantage these networks capture long-term dependencies of context information, but they can be slower than other techniques as CNN [5]. Usually, LSTM or GRU are combined with a method known as Attention Mechanism (AM) [8]. AM helps improve the training of the neural network by relating the closest words to an aspect in a sentence [5]. However, LSTM, GRU or AM were not selected because there is no clear connection with our Lifelong Learning strategy.

Aspect extraction is also performed by using a CNN [7, 13] since it can extract salient n-gram features to create an informative latent semantic representation of sentences [4, 5]. Specifically, the approach presented in [13] is one of the best results in ABSA, showing the F-measure value equal to 0.86 in only two separate domains [5]. It is important to note that several linguistic rules are included in this proposal to label aspects in an unsupervised way, which influences the effectiveness of the Deep Learning methods, but their creation is expensive in time and human effort. Our Deep Learning proposal was inspired in [13], due to, its successful results.

There are some proposals for solving the ABSA subtask using Continual or Lifelong Learning for multi-domains [5, 14], being the approach presented in [14] one of those that achieve the best results. Besides, the authors created a Lifelong Learning proposal based on Conditional Random Field (CRF). Its main disadvantages are the required engineering of features and the possible error propagation when learning other domains. Despite the mentioned disadvantages of this approach, we have selected it to be compared with our proposal, due to a Lifelong strategy is used, and CRF has reported good results in ABSA [2, 14].

We explored the main approaches that deal with catastrophic forgetting to consider their ideas for extracting aspects in multi-domains. In [6] the authors apply the fine tuning for reducing the learning rate to prevent significant changes in the network parameters while training with new data. This solution avoids losing previous knowledge but limits new learning. Another approach applies feature extraction techniques where the network parameters in a previous task are not changed and the output layer is used to extract features in a new task [10]. The Elastic Weight Consolidation (EWC) approach improves other catastrophic forgetting proposals, but it is not capable of learning new categories incrementally [9].

The use of the output layers in a Lifelong model is more simple and allows the lower layers to learn the common word features in a multi-domain training. The following section explains how Deep and Lifelong Learning are amalgamated for extracting aspects and reducing catastrophic forgetting in multi-domains.

3 New Multi-domain Aspect Extraction Model Description

In this section, we introduce a new model for aspect extraction based on Deep and Lifelong Learning through catastrophic forgetting reduction in multi-domains. The Deep Learning architecture used is based on the CNN solution presented in [13], whereas the selected Lifelong Learning technique follows some features of the model proposed in [10].

Our model is composed of four main stages, as shown in Fig. 1. For each domain, an output layer connected to the last layer of the CNN is created to define the domain-dependent characteristics and parameters. The final stage offers a Deep Learning machine ready for use in diverse information systems.

Stage 1: Textual Representation. This stage receives the original textual opinions and returns the Word Embedding [11] vector per word of each sentence, through the following steps:

  • Pre-process textual opinions by applying a sentence splitter and a Part-of-Speech (POS) tagger.

  • Obtain the word vector model from the pre-trained Word Embeddings.

Stage 2: Basic Knowledge Extraction. This stage is in charge to learn the knowledge to be included in the Knowledge Base (KB) depending on the CNN training process for each current domain. The outputs of this stage are the new parameters obtained in the training process.

Stage 3: Knowledge Base Upgrade. In this stage, catastrophic forgetting is avoided through the analysis of the training process results. The loss or errors obtained in the current domain and the previous ones are evaluated, i.e., an analysis of the results corresponding to the output layer associated with each domain is done. The KB is enriched by the training process and the new aspects extracted from different domains.

Stage 4: Aspect Extractor Creation. This stage makes available the Deep Learning machine for solving the ABSA subtask in multi-domains. The final configuration of the CNN machine is obtained from the common parameters in the KB for multi-domains.

Fig. 1.
figure 1

Multi-domain aspect extraction model based on deep and lifelong learning.

The seven layer CNN architecture used in our approach is very similar to the model presented in [13]. After pre-processing the sentences, we add for each word in a sentence its corresponding POS tag to the Word Embedding vector. We used six basic POS tags (noun, verb, adjective, adverb, preposition, other) encoded as a six-dimensional binary vector. This binary vector is concatenated to the 300-dimensional vector corresponding to each Word Embedding obtained from a pre-trained Word2VecFootnote 1. Thus, the feature vector associated with each word is 306-dimensional. Consequently, for each sentence to train, the input of the CNN model is a matrix of 82 rows and 306 columns, where the number of rows means the maximum number of words in the training sentences. The number of columns represents the Word Embedding vector size and the six-dimensional POS tag binary vector.

The second layer in our CNN architecture is a convolutional layer with 100 feature maps by using a filter equal two. Its output was computed using the hyperbolic tangent. The third and fifth layers are maxpooling layers with pool size equal two. The fourth layer is a convolutional layer and it has 50 feature maps with a filter equal three. Its output was also computed through an hyperbolic tangent function. The sixth layer is a fully connected layer. In our proposal, each domain defines its fully connected seventh layer. The stride for each convolutional layer is one for considering the relation between each word [13].

We used regularization with dropout in the penultimate layer and a restriction of type \(L_{2}\) for computing the weight vector. Besides, we exploit the combination of convolution and maxpooling layers as well as other neural network parameters as are defined in most of CNN approaches [4, 5, 7]. Other architectures in our model did not show significant different results.

In the learning process of our CNN model, a windows of five words surround is taken for representing each word in a sentence (e.g.; the analyzed word and two words on the right and left sides). This strategy responds to the possible relationship between the aspect terms and the words most closed to them. Other window sizes did not contribute to increase the accuracy. The error estimation is made by applying the Viterbi algorithm [15] to the output layer of each domain.

Applying Deep Learning models in ABSA requires a large amount of aspect-sentiment labeled data. Nevertheless, such labeled data are often scarce [5]. The dataset annotation process is very expensive and delayed in time. Besides, the existing labeled datasets do not contain the needed amount of information. To deal with the required amount of labeled data, we apply an automatic aspect sentiment labeling strategy based on the linguistic rules defined in [13]. These rules identify a possible aspect in a sentence by using the syntactic relations between words in a sentence and the opinion words present in the SenticNetFootnote 2 resource. The syntactic relations were determined by the Stanford Dependency ParserFootnote 3.

Finally, we present a new Lifelong Learning method to avoid catastrophic forgetting called Learning without Forgetting with Linguistic Rules (Lwf-CNN-lgR). Our propposal is inspired in the approach proposed in [10], due to it was successful used to avoid catastrophic forgetting during a CNN learning process. Because of the approach proposed in [10] was applied to image multi-classification, we had to adapt it for labelling the words in the sentences. We fit the loss function by a cross-entropy between the Viterbi algorithm output and the correct sequence of tags in a sentence. The main goal of this upgrade is to achieve a more effective ABSA learning. Besides, a set of linguistic rules was added to obtain more aspects from unlabeled datasets and enrich the Knowledge Base.

Lwf-CNN-lgR is shown in Algorithm 1, where the subindex s indicates the parameters that are shared by all domains (CNN model), the subindex c is related to specific parameters of the current domains and the subindex p is associated to the specific parameters of the last domains. The constant \(\lambda _{prev}\) is a penalty to the influence on the loss in the previous domains. \({{\varvec{Loss}}}_{prev}()\) and \({{\varvec{Loss}}}_{cnt}()\) functions represent the loss to previous and current domains in each training moment. The \({{\varvec{R}}}\)() function is responsible for adjusting the neural network regularization. \(\varTheta _{s}\) is a set of parameters shared across all domains (weights in our CNN model), \(\varTheta _{p}\) is a set of parameters learned specifically from previous domains, and \(\varTheta _{c}\) is a randomly initialized parameter in the current domain (weights in the output layer for each current domain).

figure a

In Algorithm 1, the parameters corresponding to the current domain output layer are randomly initialized, as shown in line 1. The parameters of each output layer from previous domains are stored. The CNN model is trained by using each sentence in the dataset. Outputs are obtained for each previous domain and the current one, as shown in lines 3 and 4.

Line 5 shows how the loss values between the current domain and the previous ones are minimized. This process updates the parameters of the neural network by means of the regularization and descendant gradient. The loss combinations between the current and previous domains permit to avoid the catastrophic forgetting. A high value of the \(\lambda _{prev}\) constant determines more influence of previous domains loss values. The two final results are the trained CNN model and the output layer of the last domain. This output layer contains the common knowledge among all output layers. These two results are contained in the KB.

Our proposal requires the use of tools able to provide a grammatical structure of the text. By analyzing the state-of-the-art we noticed that one of the most prominent libraries is the one provided by Stanford Dependency Parser, which also affords a POS tagger and a sentence splitter. Additionally, the model was trained by using the TensorflowFootnote 4 framework. The trained model can be used for analyzing opinions on public services or products as a service to third parties or as part of an information retrieval module in areas such as electronic government or business intelligence.

4 Experimental Results

We select the seven datasets used in [14] to evaluate our proposal. Besides, two more datasets about restaurants [12] and hotel reviews from TripAdvisorFootnote 5 are included to explore in-depth the performance of our model in diverse domains, not only about electronic devices. The linguistic rules were applied to the unlabeled datasets used in [1] to increase the training dataset. This dataset provides 1000 reviews tracked from Amazon for each of 50 domains about electronic devices such as keyboards, car stereo, tablets, etc. We design the experiments to compare the performance of our approach to the state-of-the-art models published in [2, 14]. We name the proposals to be compared as follow:

  • Lifelong CRF: The approach presented in [14] which uses a CRF model in a Lifelong learning scheme.

  • CRF: A linear chain CRF model evaluated in a multitasking learning scheme to consider all domains at the same time [2].

  • Lwf-CNN: Our proposal using only CNN and Lifelong Learning without linguistic rules, i.e., the unlabeled datasets are not used.

  • Lwf-CNN-lgR: Our proposal by combining CNN and the Lifelong Learning scheme enriched with linguistic rules.

For the CNN model, we use the pre-trained Google News with Skip–gram model as Word Embeddings [11]. We randomize other model parameters from a uniform distribution U(−0.05, 0.05). The learning rate used was 0.01 (other values did not show better performances) with the Gradient Descent Optimizer. The selected value for the epoch is 100. The batch training was not used because we want to obtain more effective results. The \(\lambda _{prev}\) value is equal to 0.0056 for controlling the influence of the previous learning in the current learning.

The selected evaluation measures are precision (P), recall (R), and F1-score (F1) due to they are widely used in ABSA [2, 5] and they were applied in [14]. We conducted both cross-domain and in-domain tests. Our problem setting is cross-domain; nevertheless, in-domain is used for completeness as done in [14].

The cross-domain experiments combine six labeled domain datasets for training and test in the seventh remaining domain (not used in training). The in-domain experiments train and test on the same six domains excluding one of the seven. Figure 2 shows the F1-score results corresponding to the cross-domain Deep and Lifelong Learning evaluation results, whereas Fig. 3 those corresponding to the in-domain experiments. In Fig. 2, each domain in the x-axis means that it was not used in training, while it means that the other six domains were used in both training and testing (thus in domain) in Fig. 3.

The best F1 results are obtained by the Lwf-CNN-lgR model, as shown in Fig. 2. The Lwf-CNN-lgR model benefits from the word coincidence cross similar domains, as well as the unlabeled datasets associated with the electronic device domain used at the beginning of the training process, which justifies the achieved results. In the in-domain experiments, the Lwf-CNN-lgR model is mostly the winner, although for some domains it is surpassed by Lwf-CNN, as shown in Fig. 3.

Fig. 2.
figure 2

Cross-domain F1-score results corresponding to the Deep and Lifelong Learning evaluation results.

Fig. 3.
figure 3

In-domain F1-score results corresponding to the Deep and Lifelong Learning evaluation results.

As mention before, we include two new datasets (restaurant and hotel reviews) to analyze in depth how our proposal behaves when the domains are more diverse. Then, the Lwf-CNN and Lwf-CNN-lgR models were evaluated in a cross-domain scheme using in total nine domains, i.e., the seven previous ones and the two additional added. The Precision, Recall, and F1-score values behave below the average results achieved in the previous experimentation, which evidences that our proposal is still sensitive to very diverse domains, as shown in Table 1. Low values are caused by the existence of new aspects in the restaurant and hotel review datasets that did not appear in the datasets associated to electronic device domains.

Table 1. Cross-domain deep and lifelong learning evaluation results in restaurant and hotel review datasets.

Two semantically close domains (DVD player and Mp3 player) and not too close ones (Computer and DVD player) were selected to determine the overcoming catastrophic forgetting during a new domain training. To perform the experiment, the model was first trained with the old domain and then with the new one. After the training of the new domain, testing was executed for each domain. The results illustrate that it is possible to maintain acceptable performance results during each domain training, as shown in Fig. 4. However, the best result is obtained when we test our proposal on semantically close domains.

Fig. 4.
figure 4

Experiment results about overcoming catastrophic forgetting.

Our proposal, in general, outperforms the results obtained in [14]. The use of the linguistic rules to increase the training dataset allows to improve the results of Deep Learning methods where there are few data. The deep CNN, which is non-linear in nature, improves the CRF model results. The main advantage of our framework is that it does not need any feature engineering in the Lifelong learning model. Our proposal highlights the importance of combining a trained model in a supervised way with linguistic patterns in the ABSA task.

5 Conclusions

In this work, the multi-domain aspect extraction subtask was performed by combining CNN and Lifelong Learning. Our model reduces catastrophic forgetting in the multi-domain context. The achieved results improve one of the most significance state-of-the-art proposal. The combination of CNN and Lifelong learning techniques in ABSA subtask constitutes a novel proposal in the Sentiment analysis research field. Although the obtained results are promising, the future work will be oriented to test with other algorithms to avoid catastrophic forgetting and evaluate the Lifelong Learning model with other Deep Learning techniques.