Review helpfulness evaluation and recommendation based on an attention model of customer expectation

Qu, Xianshan; Li, Xiaopeng; Farkas, Csilla; Rose, John

doi:10.1007/s10791-020-09385-x

Review helpfulness evaluation and recommendation based on an attention model of customer expectation

Special Issue on ECIR 2020
Published: 02 January 2021

Volume 24, pages 55–83, (2021)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Review helpfulness evaluation and recommendation based on an attention model of customer expectation

Download PDF

Xianshan Qu ORCID: orcid.org/0000-0002-5802-7025¹,
Xiaopeng Li¹,
Csilla Farkas¹ &
…
John Rose¹

833 Accesses
1 Citation
Explore all metrics

Abstract

With the fast growth of e-commerce, more people choose to purchase products online and browse reviews before making decisions. It is essential to identify helpful reviews, given the typical large number of reviews and the various range of quality. In this paper, we aim to build a model to predict review helpfulness automatically. Our work is inspired by the observation that a customer’s expectation of a review can be greatly affected by review sentiment and the degree to which the customer is aware of pertinent product information. Consequently, a customer may pay more attention to that specific content of a review which contributes more to its helpfulness from their perspective. To model such customer expectations and capture important information from a review text, we propose a novel neural network which leverages review sentiment and product information. Specifically, we encode the sentiment of a review through an attention module, to get sentiment-driven information from review text. We also introduce a product attention layer that fuses information from both the target product and related products, in order to capture the product related information from review text. Our experimental results for the task of identifying whether a review is helpful or not show an AUC improvement of 5.4% and 1.5% over the previous state of the art model on Amazon and Yelp data sets, respectively. We further validate the effectiveness of each attention layer of our model in two application scenarios. The results demonstrate that both attention layers contribute to the model performance, and the combination of them has a synergistic effect. We also evaluate our model performance as a recommender system using three commonly used metrics: NDCG@10, Precision@10 and Recall@10. Our model outperforms PRH-Net, a state-of-the-art model, on all three of these metrics.

An Attention Model of Customer Expectation to Improve Review Helpfulness Prediction

MulAttenRec: A Multi-level Attention-Based Model for Recommendation

Attention Network for Product Characteristics Prediction Based on Reviews

1 Introduction

E-commerce has become an important part of our daily life. More and more, people choose to purchase products online. According to recent studies (Fullerton 2017; Kats 2018), most online shoppers browse reviews before making decisions. It is essential for users to be able to find reliable reviews of high quality. To this end, several websites have implemented a voting mechanism that allows users to give feedback for online reviews. However, it is likely that users have yet to provide feedback on initial product reviews, and in the case of older products, recently posted reviews may not receive votes due to their low exposure. Therefore, an automatic helpfulness evaluation mechanism is in high demand to help users evaluates these reviews.

Previous works typically derived useful information from different sources, such as review content (Hong et al. 2012; Martin and Pu 2014; Yang et al. 2015), metadata (Fan et al. 2019; Martin and Pu 2014; Mudambi and Schuff 2010), and context (Lu et al. 2010; O’Mahony and Smyth 2009; Tang et al. 2013). However, such features were extracted from each source independently, without considering possible interactions. In particular, previous approaches do not take into account the customer review evaluating process. A customer’s perception of helpful information of a review is affected by the sentiment of the review and what the customer already knows about the product. Before reading a review text for a product, the customer is very likely to be aware of background information such as star rating, product attributes, etc.

When a customer reads a review, the customer’s expectations may be affected by the sentiment of the review. If the review gives a low star rating, he may hold a negative opinion towards the item at first and mainly look into those aspects of the review supporting the low star rating. Consider the following example:

I loved the simplicity of the mouse, ...and it was very comfortable ...About 4 months of owning the mouse the scroll wheel seemed to be in always clicked in position, and would only stop after clicking it down hard for a couple seconds. I’m very disappointed with the quality of the mouse. ...

The above review has a star rating of 2 out of 5. For a review with an overall negative sentiment like this, we may pay more attention to its descriptions of bad aspects (text in italics) of the product than we do to the good aspects. Therefore, each word/sentence may contribute unequally to the helpfulness of a review, with regard to its sentiment. Although review sentiment has been previously explored (Huang et al. 2015; Martin and Pu 2014; Mudambi and Schuff 2010), previous works have not used review sentiment to identify useful information from review text.

In addition, the customer likely has some preconceptions of the product features they are most interested in. With these expectations in mind, the customer pays special attention to those aspects of the review text that they find most salient. For example, for a review of a computer mouse, we may expect to see the comments related to attributes such as hand feel, ease of use, scroll wheel and so on. Such attributes are considered helpful and garner more attention. Moreover, although the attributes that customers are interested in may be quite similar, the degree of importance of these attributes may vary from product to product. Consider the above review for example, here scroll wheel may be the most salient feature for the mouse. There have been earlier efforts (Fan et al. 2019; Hong et al. 2012; Liu et al. 2007) at capturing useful information from a review by considering product information. However, the unique aspects of each product (different levels of importance of attributes, evaluation standard, etc.) were not fully identified in those efforts.

Our research focus is on evaluating the helpfulness of online reviews. We have explored two tasks: first, given a review, we evaluate if it is helpful or not. Second, for each product, we recommend the top n helpful reviews for users. As described above, we have insights to improve the performance of review helpfulness evaluation from two perspectives. Therefore, we have to address the following two research questions: (a) Can the sentiment of a review be used to identify the helpful information from a review and improve review helpfulness evaluation? (b) Can product related attributes, especially the unique attributes of each product, be used to identify helpful information from a review and improve review helpfulness evaluation?

In this paper, we explore these research questions and address design and performance issues in previous approaches to evaluating the helpfulness of online reviews. We propose a novel neural network architecture to introduce sentiment and product information when identifying helpful content from a review text. First, we use a hierarchical bi-directional LSTM to generate sentence-level and review-level representations. Then we augment the model with two attention layers to encode the sentiment and product information, respectively, into the review representation. The sentiment attention layer captures the sentiment-influenced importance of each word/sentence in the review. The product attention layer is designed to capture important attributes of a review from both related products and the particular product under consideration. We combine the review representations learned from the two attention layers with the expectation that these representations will behave synergistically. This study extends the work in Qu et al. (2020), the main contributions are summarized as follows:

To our knowledge, we are the first to propose that customers may have different expectations for reviews that express different sentiments. We design a sentiment attention layer to model sentiment-driven changes in user focus on a review.
We propose a novel product attention layer. The purpose of this layer is to automatically identify the important product-related attributes from reviews. This layer fuses information not only from related products, but also from the specific product.
We evaluate the performance of our model on two real-world data sets: the Amazon data set and the Yelp data set. We consider two application scenarios: cold start and warm start. In the cold start scenario, our model demonstrates an AUC improvement of 5.4% and 1.5% on Amazon and Yelp data sets, respectively, when compared to the state of the art model. We also validate the effectiveness of each of the attention layers of our proposed model in both two scenarios.
In addition, we evaluate the performance of our model from the perspective of recommendations based on three metrics: NDCG@10, Precision@10 and Recall@10. Our model outperforms the state-of-the-art model PRH-Net designed by Fan et al. (2019) on all three of these metrics.

2 Related work

Previous studies have concentrated on mining useful features from the content (i. e., the review itself) and/or the context (other sources such as reviewer or user information) of the reviews (Hong et al. 2012; Kim et al. 2006; Liu et al. 2007; Martin and Pu 2014; Mukherjee et al. 2017; Ocampo Diaz and Ng 2018; O’Mahony and Smyth 2009; Tang et al. 2013; Xiong and Litman 2011; Yang et al. 2015).

Content features have been extracted and widely utilized. They can be roughly broken down into the following categories: structural features, lexical features, syntactic features, emotional features, semantic features, and argument features (Hong et al. 2012; Kim et al. 2006; Liu et al. 2017, 2007; Martin and Pu 2014; Mukherjee et al. 2017; Xiong and Litman 2011; Yang et al. 2015). Structural features include the number of tokens and sentences, the percentage of question sentences, the star rating, and so on. They are related to structural properties and are used to reveal a user’s attitude towards a product. Lexical features including unigrams and bigrams are weighted by tf-idf to represent a text. Syntactic features, such as the number/percentage of the verbs and nouns in a review, are used to capture the linguistic properties. Emotional features usually adopt 20 emotion categories from the Geneva Affect Label Coder dictionary. The frequency of each emotion category and the number of non-emotional words are counted as emotional features. For semantic features, researchers leveraged the existing linguistic dictionary INQUIRER to represent a review in semantic dimensions (Yang et al. 2015). For argument features, researchers focused on the argumentative sentences in a review and examined them from different perspectives like component, token, letter, and position (Liu et al. 2017).

Prior works have generally investigated one or more content features. For instance, Kim et al. (2006) investigated a variety of content features from Amazon product reviews, and found that features such as review length, unigrams and product ratings are most useful in measuring review helpfulness. Yang et al. (2015) mainly exploited two semantic features (i. e., Linguistic Inquiry and Word Count, and General Inquirer) to analyze and predict helpful reviews. Martin and Pu (2014) proposed that emotional words play an important role in predicting helpfulness of review text. They extracted emotion from reviews by making use of GALC, a general lexicon of emotional words associated with a model representing 20 different categories, and results show that emotion based methods outperform previous structure based approach.

Context features have also been studied to improve helpfulness prediction (Lu et al. 2010; O’Mahony and Smyth 2009; Tang et al. 2013). For example, O’Mahony and Smyth (2009) combined features mined from the reviewer and the wider community reviewing activity, and features derived from the review text. Lu et al. (2010) examined social context that may reveal the quality of reviewers to enhance the prediction of the quality of reviews. Tang et al. (2013) identified the context information from the aspects of reviewers, raters and their relationship, and designed a context-aware model to predict review helpfulness. While context information shows promise for improving helpfulness prediction, it may not be available across different platforms and is not appropriate for designing a universal model.

Deep neural networks have recently been proposed for helpfulness prediction of online reviews (Chen et al. 2018, 2019; Fan et al. 2018, 2019; Qu et al. 2018). To tackle the problem of insufficient labeled data to build the review helpfulness model, Chen et al. (2018) proposed a model with a transfer learning module to adapt domain knowledge. The shared and domain-specific features are maintained separately by introducing adversarial and domain discrimination losses. Chen et al. (2019) designed a word-level gating mechanism to represent the relative importance of each word. Fan et al. (2018) proposed a multi-task paradigm to predict the star ratings of reviews and to identify the helpful reviews more accurately. They also utilized the metadata of the target product in addition to the textual content of a review to better represent a review (Fan et al. 2019).

Available data sets Most prior research has utilized data sets constructed from Amazon product reviews (Chen et al. 2018, 2019; Hong et al. 2012; Kim et al. 2006; Liu et al. 2007; Mukherjee et al. 2017; Yang et al. 2015). The data set size varies from $\sim $23K reviews of one product category (Liu et al. 2007) to $\sim $2.9M reviews of five product categories (Chen et al. 2018). Some researchers also considered multiple data sources. Martin and Pu (2014) adopted three data sets collected from Amazon, Yelp, and TripAdvisor, respectively. But the data set from TripAdvisor is relatively small, containing only $\sim $68K reviews. Tang et al. (2013) and Lu et al. (2010) constructed data sets from Ciao, a popular product review site. In contrast to Amazon, which allows users to give a binary vote for review helpfulness, Ciao supports scores ranging from 0 to 5 to indicate the helpfulness of a review. However, the Ciao data sets that they used are not publicly available. Fan et al. (2019) used two large-scale data sets: $\sim $23.8M reviews from 9 Amazon product categories and $\sim $2.6M reviews from 5 Yelp product categories. We employ the same data sets in our work as Fan et al. (2019).

The methods summarized above are representative of the research progress in review helpfulness prediction. Sentiment and product information have been explored previously (Fan et al. 2019; Huang et al. 2015; Kim et al. 2006; Martin and Pu 2014). With respect to sentiment, Martin and Pu (2014) extracted emotional words from review text to serve as important parameters for helpfulness prediction. Huang et al. (2015) found that the sentiment of a review is positively correlated with review helpfulness. However, previous research has not taken into account differences in customer expectations that can result from review sentiment perception. With respect to product information, Fan et al. (2019) tried to better represent the salient information in reviews by considering the metadata information (title, categories) of the target product. However, this information can be quite similar for products of the same type, so the unique aspects of each product (different degrees of importance of attributes, evaluation standard, etc.) can not be fully captured from reviews. Wu et al. (2018) proposed an architecture that is superficially similar to ours in the sense that both architectures are based on LSTM networks and attention layers. They utilized a user attention layer and a product attention layer to capture sentiment-related information. In contrast, we design a sentiment attention layer and a product attention layer to identify the helpful information from a review text. Consequently, the internal design of our attention layers are different from theirs as they serve completely different purposes.

3 Our proposed model

Our model is shown in Fig. 1. It is built upon a hierarchical bi-directional LSTM, which is a standard model for document understanding and classification (Liu and Guo 2019; Yang et al. 2016). We proposed two novel attention layers that incorporate sentiment and product information in order to improve review representations. The product attention layer is designed by fusing the information from both the target product and related products. The sentiment information is also encoded to capture helpful information from a review through the attention mechanism. After applying two attention layers separately, we get a combination of two review representations to predict the review helpfulness, which has a joint effect. As the main components of our model are the Hierarchical bi-directional LSTM, the Sentiment Attention layer, and the Product Attention layer, we refer to our model as hsapa.

3.1 Hierarchical bi-directional LSTM

Our proposed model is based on a hierarchical bi-directional LSTM. A bi-directional LSTM model is able to learn past and future dependencies. This provides a better understanding of context (Melamud et al. 2016). The hierarchical architecture includes two levels: the word level and the sentence level. These levels learn dependencies between words and sentences, respectively.

Word encoder A bi-directional LSTM consists of two LSTM networks that process data in opposite directions. At the word level, we feed the embedding of each word into a unit of both LSTMs, and get two hidden states. We then concatenate these two hidden states as a representation of a word. The process is defined as:

$$\begin{aligned}&\overleftarrow{h}_{ij} = \overleftarrow{LSTM}(x_{ij}) \end{aligned}$$

(1)

$$\begin{aligned}&\overrightarrow{h}_{ij} = \overrightarrow{LSTM}(x_{ij}) \end{aligned}$$

(2)

$$\begin{aligned}&h_{ij} = [ \overleftarrow{h}_{ij}, \overrightarrow{h}_{ij}] \end{aligned}$$

(3)

where $x_{ij}$ is the embedding vector of the ith word of the jth sentence. $\overrightarrow{h}_{ij}$ and $\overleftarrow{h}_{ij}$ are hidden states learned from bi-directional LSTM. The state $h_{ij}$ is the concatenation of these hidden states for the word $x_{ij}$.

Sentence encoder At the sentence level, a sentence representation is learned through an architecture similar to that used for the word level:

$$\begin{aligned}&\overleftarrow{h}_{j} = \overleftarrow{LSTM}(s_{j}) \end{aligned}$$

(4)

$$\begin{aligned}&\overrightarrow{h}_{j} = \overrightarrow{LSTM}(s_{j}) \end{aligned}$$

(5)

$$\begin{aligned}&h_{j} = [ \overleftarrow{h}_{j}, \overrightarrow{h}_{j}] \end{aligned}$$

(6)

where $s_j$ refers to a weighted representation of the jth sentence after applying the attention layer. The state $h_{j}$ is the the final representation for the sentence $s_j$ by concatenating the hidden states $\overrightarrow{h}_{j}$ and $\overleftarrow{h}_{j}$.

3.2 Sentiment attention layer

For reviews that express different types of sentiment (positive, negative, etc.), customers may have different expectations, and attend to different words or sentences of a review. In order to learn the sentiment-influenced importance of each word/sentence, we propose a custom attention layer.

In this attention layer, we use an embedded vector to represent each type of sentiment. We use the star rating of each review to indicate its sentiment, and map each discrete star rating into a real-valued and continuous vector Sent. For example, for Amazon reviews, a reader can give a star rating ranging from 1 to 5. In this case, we’ll have 5 vectors that represent the five different types of sentiment. This vector is initialized randomly, and updated gradually through the training process by reviews with the corresponding star rating. Sent can be interpreted as a high level representation of the sentiment-specific information. We measure the similarity between the sentiment and each word/sentence using a score function. The score function is defined as:

$$\begin{aligned} f(Sent, h_{ij}^{s}) = (v_{w}^s)^T\tanh (W_{wh}^{s}h_{ij}^{s} + W_{ws}^{s}Sent + b_{w}^s) , \end{aligned}$$

(7)

where $v_{w}^{s}$ is a weight vector, and $(v_{w}^{s})^T$ indicates its transpose, $W_{wh}^{s}$ and $W_{ws}^{s}$ are weight matrices, and $b_{w}^{s}$ is the bias vector. At the word level, the input to the score function is the abstract sentiment representation Sent and the hidden state of the ith word in the jth sentence $h_{ij}^{s}$. Next, we use the softmax function to normalize the scores to get the attention weights:

$$\begin{aligned} \alpha _{ij}^{s} = \frac{\exp (f(Sent, h_{ij}^{s}))}{\sum _{k=1}^{l}\exp (f(Sent, h_{kj}^{s}))}, \end{aligned}$$

(8)

$\alpha _{ij}^{s}$ is the attention weight for the word representation $h_{ij}^{s}$.

The sentence representation is a weighted aggregation of word representations, the jth sentence is represented as Eq. 9. The number of words in the jth sentence is denoted by l. The representation of a review is also a weighted combination of sentence representations defined as Eq. 10, where $h_{j}^{s}$ is the hidden state of the jth sentence $s_{j}^{s}$, which is learned through the bi-directional LSTM. The value m refers the number of sentences in a review.

$$\begin{aligned}&s_{j}^{s} = \sum _{i=1}^{l} \alpha _{ij}^{s} h_{ij}^{s} . \end{aligned}$$

(9)

$$\begin{aligned}&r^{s} = \sum _{j=1}^{m} \beta _{j}^{s} h_{j}^{s} . \end{aligned}$$

(10)

The value $\beta _{j}^{s}$ indicates the corresponding attention score for $h_{j}^{s}$. The weight score $\beta _{j}^{s}$ is calculated based on the score function f(.) defined as:

$$\begin{aligned}&f(Sent, h_{j}^{s}) = (v_{s}^s)^T\tanh (W_{sh}^{s}h_{j}^{s} + W_{ss}^{s}Sent + b_{s}^{s}) , \end{aligned}$$

(11)

$$\begin{aligned}&\beta _{j}^{s} = \frac{\exp (f(Sent, h_{j}^{s}))}{\sum _{k=1}^{m}\exp (f(Sent, h_{k}^s))}. \end{aligned}$$

(12)

3.3 Product attention layer

As shown in the top right corner of Fig. 1, the Product Attention Layer consists of two components: related product information and unique product information. Metadata information is embedded and fed into a CNN model (Kim 2014) to capture the related product information, and the product identifier is encoded to represent the unique product information.

3.3.1 Related product information

When reading a review, customers may focus on different attributes depending on the product the review references. We take advantage of the metadata information (such as title, product description, product category, etc.) of each product to learn common attributes shared by related products. Consider the following product description of a computer mouse:

Ergonomic shape - Ergonomically shaped design and soft rubber grips conform to your hand ...Interface - USB receiver...

Convenient controls - Easy-to-reach ...

Micro-precise scroll wheel - With more grooves per millimeter...

Long battery life - 3 year battery life ...

From this description, we want to learn common product related attributes such as “shape”, “interface”, “battery life”, “scroll wheel” etc. If these attributes appear in a review text, they may attract more customer attention.

In order to capture key information from the metadata, we make use of a CNN model (Kim 2014). This CNN model generalizes well for multiple NLP tasks such as text understanding (Zhang and LeCun 2015), document classification (Johnson and Zhang 2015; Severyn and Moschitti 2015; Zhang et al. 2015), etc. The CNN model can acquire important information from a text. Moreover, it has a relatively simple architecture and fewer parameters compared to other models such as LSTM, Bi-LSTM, etc., and requires less training time.

The CNN model consists of a convolution layer, a max-pooling layer, and a fully connected layer. In the convolution layer, each filter is applied to a window of words to generate the feature map. For example, we apply a filter $w \in {\mathbb {R}}^{hk}$ to a window of words $x_{i:i+h-1}$. Here k indicates the dimension of the word vector, and $x_{i:i+h-1}$ refers to the concatenation of h words from $x_i$ to $x_{i+h-1}$. The context feature $c_{ih}$ is generated as:

$$\begin{aligned} c_{ih} = \text {ReLU}(w x_{i:i+h-1} + b) , \end{aligned}$$

(13)

where b is the bias item.

We evaluated different approaches to initializing the word vector such as the pretrained Word2Vec embedding (Mikolov et al. 2013), the pretrained GloVe embedding (Pennington et al. 2014) and random initialized embedding, as well as different vector dimensions. The pretrained GloVe with 100 dimensions was able to achieve the best performance and required relatively less training time.

A feature map of the text is then generated through $c_{h} = [c_{1h}, c_{2h},\ldots , c_{nh}]$, where $c_{1h}, c_{2h}, \ldots c_{nh}$ refer to context features extracted from different sliding windows of the text, and $c_{h}$ indicates the concatenation of these features. The feature map $c_{h}$ is then fed into a max-pooling layer, and the maximum value is extracted as $c = max\{c_{h}\}$ as the important information extracted by a particular filter. A number of filters are used, and the extracted features are concatenated and fed into a fully connected layer to generate a vector $Prod_{1}$. $Prod_{1}$ is a representation of the important related product attributes in the metadata.

3.3.2 Unique product information

Although reviews for the same type of product may share the same important attributes, the degree of importance of these attributes may vary from product to product. In order to represent the unique characteristics of each product, the unique product identifier for each product is mapped into a vector $Prod_{2}$. At the outset, $Prod_{2}$ is randomly initialized. During the training process, this vector is only updated when reviews specific to the product are used for training. Thus $Prod_{2}$ can be interpreted as a high level representation of product-specific information. The final product representation Prod is generated by combining the two vectors: $Prod_{1}$ and $Prod_{2}$ as:

$$\begin{aligned} Prod = \tanh (W_{1}Prod_{1} + W_{2}Prod_{2} + b^{p}) , \end{aligned}$$

(14)

where $W_1$ and $W_2$ are weight matrices for $Prod_1$ and $Prod_2$ respectively, and $b^{p}$ is the bias vector. We calculate the product attention weights based on the score function f(.), and the input to the score function is the product representation Prod and hidden state of a word $h_{ij}^{p}$:

$$\begin{aligned} f(Prod, h_{ij}^p) = (v_{w}^p)^T\tanh (W_{wh}^{p}h_{ij}^{p} + W_{wp}^{p}Prod + b_{w}^{p}) , \end{aligned}$$

(15)

where $(v_{w}^{p})^T$ denotes the transpose of weight vector $v_{w}^{p}$, $W_{wh}^{p}$ and $W_{wp}^{p}$ are weight matrices, and $b_{w}^{p}$ is the bias vector. Then we apply softmax function to get a normalized attention score $\alpha _{ij}^{p}$. At the word level, the sentence representation is defined in Eq. 16, where $\alpha _{ij}^{p}$ indicates the product attention score of the word representation $h_{ij}^{p}$. The representation of a review can be obtained formally through Eq. 17, where $\beta _{j}^{p}$ indicates the attention weight for hidden state of the jth sentence $h_j^{p}$.

$$\begin{aligned}&s_{j}^{p} = \sum _{i=1}^{l} \alpha _{ij}^{p} h_{ij}^{p} , \end{aligned}$$

(16)

$$\begin{aligned}&r^{p} = \sum _{j=1}^{m} \beta _{j}^{p} h_{j}^{p} , \end{aligned}$$

(17)

After applying the sentiment attention layer and the product attention layer separately, we obtain two different review representations $r^{s}$ and $r^{p}$. These two representations are concatenated as the final representation of a review $r = [r^{s}, r^{p}]$. Then, we apply a fully connected layer on top of r, to classify the helpfulness of a review.

3.4 Loss function

To minimize the difference between the predicted helpfulness value and the actual helpfulness label, we utilize cross entropy loss as the objective function. It is a commonly used loss function for binary classification, and is defined as:

$$\begin{aligned} Loss_{task} = -\sum _{i=1}^{N} (y_{i}log(p(y_{i})) + (1-y_{i})log(1-p(y_{i}))) , \end{aligned}$$

(18)

where $y_i$ indicates the actual helpfulness label, $p(y_i)$ indicates the probability of helpfulness. N is the number of training observations. We present details on how these $y_i$ are assigned in the following section.

4 Experiment and results

This section focuses on evaluating our architecture with respect to review helpfulness. Given a review, we want to determine whether or not it is helpful. We first compared our model with competing models in prior works on two data sets. Then, we evaluated the performance of different components of our architecture in two application scenarios: cold start scenario and warm start scenario. Correspondingly, we split the data into training and test data differently for the two scenarios. Last, we compared the performance of our proposed model in both warm-start and cold-start scenarios.

Evaluation metric In this study we use the Receiver Operating Characteristic Area Under the Curve (ROC AUC) statistic to evaluate the performance of our proposed model. This is a standard statistic used in the machine learning community to compare models. It is a robust statistic where imbalanced data sets are involved.

4.1 Data sets

We evaluate our model on two publicly available data sets. One data set originates from Amazon reviews and was released by Julian McAuley (He and McAuley 2016). The other data set is from the Yelp Dataset Challenge 2018 (Yelp 2018). We pre-process the data in the same way as Fan et al. (2019): First, we join the product review with corresponding metadata information. Second, we filter out the reviews that have no votes. Last, we label reviews that receive more than 75% helpful votes out of total votes as helpful, and label the remaining reviews as unhelpful.

We chose the same threshold of 75% as presented by Fan et al. (2019), in order to provide a fair comparison with their reported model performance. Moreover, the threshold of 75% makes more sense than a lower threshold such as 50%. Analysis of the data set shows that more than 80% of the reviews achieve a helpfulness vote ratio greater than 50%. In contrast, only around 60% of the reviews achieve a helpfulness vote ratio of more than 75%. If we chose a threshold of 50%, the problem would become much easier, as only the clearly unhelpful reviews would be labelled negative. More importantly, the majority of the reviews labelled as positive are not what we want. We want to identify only the most helpful reviews, in order to avoid having to read all of the reviews that would be labelled positive with a lower threshold.

4.1.1 Data set partition for cold start scenario

In practice, a new product may have not yet received any helpful votes. Therefore assessment standards can’t be captured from past voting information and can lead to the cold start problem. To evaluate model performance in this scenario, we randomly select 80% of the products and their corresponding reviews as the training data set for each product category in both data sets. The remaining products and their reviews are employed as the test data set. The statistics of the two data sets are summarized in Tables 1 and 2. All of the reviews for a given product appear only in the training data set or test data set. Consequently, all products in this test data set face the cold start problem. Even though the partitioning approach is the same as that reported by Fan et al. (2019), a consequence of the random selection of products into test and training data sets is that the actual number of reviews differs from that of Fan et al. (2019). However, the difference is less than 1%, which is not statistically significant.

Table 1 Statistics of Amazon data set in cold start scenario

Review helpfulness evaluation and recommendation based on an attention model of customer expectation

Abstract

Similar content being viewed by others

An Attention Model of Customer Expectation to Improve Review Helpfulness Prediction

MulAttenRec: A Multi-level Attention-Based Model for Recommendation

Attention Network for Product Characteristics Prediction Based on Reviews

1 Introduction

2 Related work

3 Our proposed model

3.1 Hierarchical bi-directional LSTM

3.2 Sentiment attention layer

3.3 Product attention layer

3.3.1 Related product information

3.3.2 Unique product information

3.4 Loss function

4 Experiment and results

4.1 Data sets

4.1.1 Data set partition for cold start scenario

4.1.2 Data set partition for warm start scenario

4.2 Model comparison

4.2.1 Competing models

4.2.2 Training settings

4.2.3 Results and findings

4.2.4 Significance test

4.3 Evaluating different components of HSAPA

4.4 Performance comparison of HSAPA in two scenarios

4.5 AUC gain

5 Analysis

5.1 Visualization of attention layers

5.2 Error analysis

5.3 Impact of sentiment on model performance

6 Recommender system

6.1 Evaluation metrics

6.2 Competing model

6.3 Results

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation