1 Introduction

In e-commerce, Web 2.0 provides platforms for the internet users to share their knowledge, expertise and experiences on forums, review portals, blogs and other social media websites (Bertola and Patti 2016). It also enables consumers to share their opinions and experiences about services and product usages on review platforms e.g. Trip advisor reviews, Yahoo product reviews, Yelp reviews (Krishnamoorthy 2015). Product reviews are essential part of both traditional and electronic commerce. According to a survey, Google scholar disclosed that there are 15,600 and 13,200 hits for ‘‘product reviews” and “online reviews’’ daily (Anderson and Magruder 2012). Reading product reviews has become a common step because such reviews contain important information and facilitates customers in making purchase decisions (Chen and Xie 2008). However, most of the popular products receive large collections of reviews that results in information overload problem (Liu et al. 2008).

Several studies described that sales of products are affected by product reviews and related factors of particular product categories under some conditions (Duan et al. 2008; Forman et al. 2008). Recent studies highlighted that reviewer and review characteristics such as information quantity, semantic factors, reviewer location and identity opened new dimensions in the line of research (Cao et al. 2011). However Mudambi and Schuff (2010) described that future research will focus on new dimensions of reviewers’ status such as designation of “top reviewer” of Amazon.com. Similarly, review helpfulness is an important characteristic associated with online products.

Some e-commerce websites provide consumers an interactive voting facility such as Amazon.com, asks its readers “Was this review helpful to you? Yes/No”. The reviews with higher helpful votes are ranked higher than one with lower votes. However with the exponential growth of reviews on websites, reviews are not always being consistently helpful e.g. sentiments expressed in the reviews can have varied effect on helpfulness (Forman et al. 2008). Sentiments may be categorized as mixed, favorable and unfavorable. Some users believe that favorable and unfavorable reviews are helpful while others assume that mixed reviews are helpful (Crowley and Hoyer 1994). The variations in helpfulness of reviews also exist across product types i.e. experience and search goods. The quality of the experience products is difficult to verify before use while search products can be judged on the basis of product specifications before purchase (Mudambi and Schuff 2010; Willemsen et al. 2011). Therefore consumers looking for experience products specifically rely on others’ usage experiences.

Usually online products have thousands of reviews and it is very difficult for the customers to read every review. Helpful reviews facilitate the buyers in terms of significant feedback and experiences of other buyers about the product. Other benefits are (1) Reviews can be effectively summarized by filtering low-quality reviews. (2) Websites that do not use voting feature could benefit from an automated helpfulness prediction system. (3) Review ranking system can be improved by using influential features of review helpfulness. Review helpfulness is computed as the ratio of the number of helpful votes to the total number of votes attracted by a review. This ratio is referred as the helpfulness ratio. However, review helpfulness is a multi-faceted concept that can be driven by several types of factors (both quantitative and qualitative). Here, multi-faceted means “having many different aspects or features”. The most common practice to determine the helpfulness of online reviews in the early studies was to utilize the basic features of reviews such as review length, star rating and thumbs up/down (Pang et al. 2002; Otterbacher 2009). More recent studies have focused on qualitative measures in addition to quantitative ones such as linguistics features, reviewer impact, reviewer experience and reviewer cumulative helpfulness to measure review helpfulness (Mudambi and Schuff 2010; Huang et al. 2015; Krishnamoorthy 2015). Similarly, qualitative aspects of reviews such as review types itself (including regular, comparative, suggestive reviews and cumulative helpfulness) are also explored (Qazi et al. 2016).

1.1 Motivations and contributions

Related to ML models, prior studies used basics in majority and sometimes popular ML models to construct an effective predictive model for review helpfulness. The approaches are: ordinal logistic regression (Cao et al. 2011), ordinary least squares regression with log transformation (Ghose and Ipeirotis 2011), random forest (Krishnamoorthy 2015), logistic regression (Huang et al. 2015), multilayer perceptron neural network (Lee and Choeh 2014) and support vector regression (Zhang 2008).

However, it is observed that there is always demand of more robust and strong ML algorithm to outperform. Ensemble model is one of the most robust and widely used algorithm. It is established that it works in majority of cases, adaptable, proved to improve the accuracy and produce highly accurate results. The use of Ensemble method can provide strong predictive power with high accuracy that may lead to selection of more helpful reviews for a given product. However, selection of models is really hard to master and time-consuming.

Prior studies investigated sentiments (Chua and Banerjee 2016), review effectiveness (Wu 2017), discrete emotions (Malik and Hussain 2017), order effect (Zhou and Guo 2017, cognitive script (Ngo-Ye et al. 2017), linguistic and psychological (Malik and Hussain 2018) characteristics for review helpfulness. Majority of previous studies utilized limited number of product, reviewer and/or review category features. It has been observed that there are number of significant features of review and reviewer type, which are not part of the state-of-the-art techniques presented in the literature. This research addresses following research questions:

  • Which machine learning model is more powerful that demonstrates promising performance so that an effective review helpfulness prediction model can be constructed.

  • Investigate the type of features (review/reviewer/product) that deliver maximum contribution in review helpfulness prediction.

  • Does any set of variables exist that are most influential and have strong relationship to review helpfulness?

In this study, the main objective is to investigate the relationship between product, reviewer and review type features and review helpfulness. Seventeen features are introduced which are related to product, reviewer and review category. One of the goal is to examine the contribution of proposed product, reviewer and review type indicators to helpfulness of reviews as a standalone model. Another objective is that the impact of hybrid set of features (review, reviewer and product) also need to be compared with state of the art baseline. There is also demand of a robust machine learning model that outperforms. The importance of all product, reviewer, and review indicators also need to be examined. A new dataset is prepared for experimentation that crawled reviews from 34-popular product categories of Amazon.com. Five popular ML methods and five evaluation metrics are used. An effective helpfulness prediction model is built which is based on popular Ensemble method. Theoretically, the results of the current research has contributed to relevant literature by providing further understanding of quantitative features of reviewer and review and their influence on review helpfulness. More specifically, the study takes a step further to uncover the importance of each category of feature that sheds light on the empirical relationship between these variables and review helpfulness. Additionally, the findings of the study extends the results found in existing literature (Mudambi and Schuff 2010; Lee and Choeh 2014) by looking at also the reviewer and textual aspects of online reviews. Major contributions of the proposed study includes

  1. 1.

    We are the first one that use Ensemble method to built a robust review helpfulness predictive model.

  2. 2.

    Seven review, four reviewer and six product type features are introduced. This research proposes three types of indicators as compared to prior studies which utilized reviewer and review indicators in majority.

  3. 3.

    A new review dataset is prepared for experimentation which crawled reviews from 34-popular product categories of Amazon.

  4. 4.

    Findings reveal that number of comments, sentiment and polarity of review text; reviewer activity length and reviewer recency; number of questions answered, ratio of positive reviews, and average rating per review of product are the most effective predictors to determine review helpfulness.

  5. 5.

    This research facilitates e-commerce retailers and managers in minimizing the processing costs for better organization of their product reviews.

The rest of the paper is organized as follows: Sect. 2 presents the related work followed by the Sect. 3, which presents the description of proposed features used in the model. Subsequently, Sect. 4 presents detail experimental analysis, results and discussions. Section 5 highlights the implications of this research and Sect. 6 provides concluding remarks and outline directions for future research work.

2 Literature review

Kim et al. (2006) used semantic, structural, meta-data and lexical features for the degree of helpfulness prediction. They summarized that review sentiment (valence), review length, its unigrams deliver the best results. Recently, a model is proposed by Hong et al. (2012), that uses information reliability, sentiment measure and need fulfillment based features for review classification problem. They used SVM based method for classification and demonstrated better results as compared to previous studies (Kim et al. 2006; Liu et al. 2007). Later, Quaschning et al. (2015) investigated the impact of valence consistency on review helpfulness using multilevel regression method. They examined the influence of nearby reviews and their impact on helpfulness ratio instead of prior studies who utilized only individual reviews. The results demonstrated that consistent reviews are more helpful than inconsistent reviews. The impact of emotional content in online reviews on the perceived helpfulness is investigated by Ullah et al. (2015). Authors summarized that positive emotional content have positive effect whereas negative emotional content have no effect on review helpfulness. Recently, NLP techniques are applied to analyze the emotional content of product reviews by Ullah et al. (2016). The Authors explored the difference in emotional content across search and experience goods.

Another study is conducted by Liu et al. (2007) to detect low quality reviews. The reviews are classified into low and high quality on the basis of subjectivity, readability and informativeness related features. Tsur and Rappoport (2009) formulated an unsupervised technique for ranking of online book reviews on the basis of helpfulness. The authors developed a lexicon which consist of dominated terms and virtual core review. The overall helpfulness ranking is built using similarity of each review with virtual core review. Another study is proposed by Hu and Chen (2016) to analyze the influence of review visibility, interaction between hotel stars and review ratings on hotel review helpfulness using Model tree (M5P). They concluded that interaction effect exists between hotel stars and review ratings. Furthermore, review visibility has a strong effect on review helpfulness. Recently, Berlo communication model based index system is designed by Xiang and Sun (2016) to analyze the impact of multi-typed factors on review helpfulness. Similarly, influence mechanism of the reviewer, review, and the existing votes on review helpfulness are explored by Chen et al. (2015). Authors proposed three hypothesis and results indicate that review helpfulness has significant correlation with reviewers, review valence, and review votes. A recent study is proposed by Gao et al. (2016) to investigate the consistency of reviewer’s pattern of rating over time and predictability. Authors summarized that reviewers’ rating behavior is consistent over time and across products and reviews which have higher absolute bias in rating in the past receive more helpful votes in future.

A regression model is built by Zhang and Varadarajan (2006) to predict the utility of product reviews using lexical subjectivity and similarity along with POS based terms as features. Later a linear regression model is presented by Mudambi and Schuff (2010) to determine the influential factors for helpfulness of product reviews. Their work is replicated by Huang and Yen (2013) and attained only 15% explanatory power. Singh et al. (2017) employed machine learning techniques to construct predictive models for helpfulness of product reviews. Authors utilized several textual features such as entropy, subjectivity, polarity and reading ease. Ngo-Ye et al. (2017) examined the utility of script analysis for predicting the helpfulness of customer reviews using the text regression model. The results show that proposed model delivers high accuracy with less feature subset, low training and testing time. Later, Chen (2016) developed a conceptual model to investigate how review sidedness affects online review helpfulness. Authors summarized that two-sided reviews are more helpful than one-sided when reviewers are experts in writing reviews for search goods whereas an altogether opposite conclusion has been made for experience goods. Other studies that utilizes regression models explore significant textual and non-textual features include (Cao et al. 2011; Ghose and Ipeirotis 2011; Pan and Zhang 2011; Korfiatis et al. 2012; Chua and Banerjee 2015).

A multilayer perceptron neural network based model is built by Lee and Choeh (2014), which is trained using product, reviewer and review features. The author demonstrated that neural network ML method outperforms the linear regression method for helpfulness prediction. Agnihotri and Bhattacharya (2016) examined the qualitative factors of review text and their impact on review helpfulness. Authors concluded that the moderating role of reviewer experience influence the trust of consumers. Karimi and Wang (2016) examined the effect of reviewer image along with review depth, valence and equivocality on review helpfulness. Results demonstrated that reviewer image can significantly enhance the consumers’ evaluation of review helpfulness. An exploratory case study is presented by Yang et al. (2016) and investigated the comparative importance of reviewer location, reviewer level and helpful votes, review rating and length and review photo on review helpfulness. Results reveal that review rating and reviewer helpful votes are the most important factors. However this study did not consider reviewer’s social features which may be more influential for review helpfulness. Ngo-Ye and Sinha (2014) introduced a new idea of reviewer engagement (RFM) features to improve the helpfulness prediction performance. They proposed that hybrid model (RFM and textual features) produce best predictive results. However, they did not incorporate other significant features of review that are experimentally proved to be better predictors such as subjectivity, readability and meta-data features (Kim et al. 2006; Liu et al. 2007; Ghose and Ipeirotis 2011).

Zhang et al. (2014) presented an approach to estimate the degree of review helpfulness using helpfulness distribution and confidence interval features. They performed experiments on real and synthetic datasets to validate the utility of proposed method. Later, Krishnamoorthy (2015) presented the idea of extracting linguistic features from textual content of reviews. The author demonstrated that linguistic features present better predictive accuracy for review helpfulness and hybrid set of features produce best prediction results. The relationship between helpfulness of reviews, review sentiment and product type is exhaustively investigated by Chua and Banerjee (2016). They concluded that variation of review helpfulness across sentiment of the review is independent of product type. Furthermore, helpfulness of reviews varies across information quality as a function of product type and review sentiment. Recently, a study investigated the impact of not only quantitative but also qualitative aspects of reviewers and built a conceptual model for helpfulness prediction (Qazi et al. 2016). The results suggest that average number of concepts per sentence and review type have a varying degree of impact on helpfulness.

Another study is conducted by Huang et al. (2015) to explore the impacts of quantitative and qualitative factors of reviews and reviewers such as reviewer impact, experience and cumulative helpfulness. The authors demonstrated that word count with a certain threshold is effective for helpfulness prediction and reviewer experience has a varying effect on helpfulness. Similarly Liu and Park (2015) build a text regression model using combination of review and reviewer features to predict the helpfulness of reviews. The feature set includes readability and valence of reviews, reviewer’s identity, expertise and reputation. Another study examined the moderating effect of product type, impact of reviewer reputation, identity and depth on review helpfulness (Lee and Choeh 2016). The findings revealed that reviewer reputation, review extremity, and review depth are more important for helpfulness prediction using search goods. Recently, a first attempt is made to explore the order effect on review helpfulness from social influence perspective (Zhou and Guo 2017). The findings reveal that order of the review negatively relates to helpfulness and this negative effect is inversely proportional to the reviewer expertise. We believe that these results are relevant in the context of present work. The use of influential product, reviewer and review category feature offer better predictive accuracy for helpfulness prediction problem.

3 Model features

In this study, we introduce new characteristics of reviews of the three major categories namely product, reviewer and review. The majority of characteristics from each category are being proposed in this study which plays a vital role in helpfulness prediction. The hybrid features (product, reviewer, review) form the final feature matrix. Following subsections describe the three types of features in details.

3.1 Review features

Review characteristics are the most important determinants for the helpfulness prediction of online reviews. We proposed new review features in this study that are: (1) Cosine similarity between review text and product title (2) Number of comments posted on review (3) Polarity and (4) Sentiment score of review text. In addition, proposed set of features are combined with existing features presented by the literature (Lee and Choeh 2014; Chua and Banerjee 2016). These features are (1) Sentiment of a review in terms of review rating (2) Elapsed days since the posting date (3) Percentage of adjectives in the review text.

Cosine similarity between review text and title: Cosine similarity measure between review text and review title investigate the similarity of both texts by looking at the angle instead of magnitude. The goal of this feature is to investigate that in what percentage the review title and review text are similar. In particular, by reading only review title instead of whole review text, user can easily analyze and decide whether review is helpful or not. Well-defined and comprehensive titles of the reviews provide summarized opinion and guidelines that will enable the online users to find the helpful reviews for a given product.

Polarity and Sentiment score of review text: In proposed study, SentiStrength software Thelwall et al. (2010) is used for the sentimental analysis of the review text. The scores of the two features (Polarity and Sentiment) are calculated by adapting the technique used by Stieglitz and Dang-Xuan (2013). The number of comments posted on the review text could be obtained easily from Amazon.com. Sentiment of review in terms of review rating is determined based on review star ratings. Reviews with five-star rating and one star rating are termed as favorable and unfavorable respectively whereas reviews with ratings ranging from two to four stars were referred as mixed. The mathematical expressions of the proposed features are given below:

$$ {\text{Sentiment}} = \left( {{\text{Positive sentiment}} - {\text{negative sentiment}}} \right) - 2 $$
(1)
$$ {\text{Polarity}} = {\text{Positive \, sentiment}} + {\text{negative \, sentiment}} $$
(2)
$$ \text{Cos} \left( {{\text{x}}, {\text{y}}} \right) = \frac{{{\text{x}} \cdot {\text{y}}}}{{\left\| {\text{x}} \right\|\left\| {\text{y}} \right\|}} = \frac{{\mathop \sum \nolimits_{{{\text{i}} = 0}}^{{{\rm n} - 1}} {\text{x}}_{\rm i} {\text{y}}_{\text{i}} }}{{\sqrt {\mathop \sum \nolimits_{{{\text{i}} = 0}}^{{{\rm n} - 1}} {\text{x}}_{\rm i}^{ 2} } \sqrt {\mathop \sum \nolimits_{{{\text{i}} = 0}}^{{{\rm n} - 1}} {\text{y}}_{\rm i}^{ 2} } }} $$
(3)

where x and y are two vectors. Positive score range is from 1 (not positive) to 5 (extremely positive) and negative score range is from − 1 (not negative) to − 5 (extremely negative) respectively.

3.2 Reviewer features

Previous studies demonstrated that reviewer characteristics are the influential factors to predict the helpfulness of online reviews. Rank and Total number of reviews of the reviewer are proved to be significant predictors in the literature (Lee and Choeh 2014). In this study, we have introduced two effective reviewer characteristics: ‘Reviewer Activity Length’ and ‘Reviewer Recency’ along with ‘Reviewer Rank’ and ‘Total Number of Reviews’.

Reviewer activity length: It has been observed in the literature that temporal dimension plays a significant role in prediction of review helpfulness. Therefore, reviewer activity length feature is being proposed in this study which is calculated as the number of days between first review and the last review written by the reviewer. This represents that how much a reviewer is actively participating in writing the recent reviews for products. Longer activity time shows that reviewer is actively involved in writing recent reviews. Higher activity length of a reviewer corresponds to higher number of potential reviews.

Reviewer recency: With the passage of time, numbers of helpful reviews get accumulated. Therefore, number of days since first review was written by the reviewer becomes an important characteristic and positive correlation is expected on temporal recency, the longer a reviewer has written his/her first review the more helpful reviews he/she may write. Mathematical expression for computing the proposed features are given below:

$$ {\text{RV}}\_{\text{activity}}\_{\text{length}} = {\text{Time}}\;{\text{between}}\;{\text{first}}\;{\text{review}}\;{\text{and}}\;{\text{last}}\;{\text{review}}\;\left( {{\text{number}}\;{\text{of}}\;{\text{days}}} \right) $$
(4)
$$ {\text{RV}}\_{\text{recency}} = {\text{Active}}\;{\text{since}}\;{\text{first}}\;{\text{review}}\;\left( {{\text{number}}\;{\text{of}}\;{\text{days}}} \right) $$
(5)

3.3 Product features

Product features are important predictors for the degree of review helpfulness prediction. The list of features under each category is presented in Table 1 along with their description. Product features proposed in this study are: ‘average rating per review’, ‘number of questions answered’, ‘ratio of positive reviews’ and ‘ratio of critical reviews’ of the product. Whereas two features (‘average review rating of the product over time’ and ‘Elapsed time from the product release date’) are taken from prior studies (Lee and Choeh 2014). The mathematical descriptions of the proposed features are given below:

$$ {\text{Average}}\;{\text{rating}}\;{\text{per}}\;{\text{review = Average}}\;{\text{review}}\;{\text{rating}}\;{\text{of}}\;{\text{the}}\;{\text{product/total}}\;{\text{reviews }} $$
(6)
$$ {\text{ratio}}\_ {\text{pos}} = {\text{Total number of positive reviews }}/{\text{total reviews}} $$
(7)
$$ {\text{ratio}}\_{\text{crit}} = {\text{Total}}\;{\text{number}}\;{\text{of}}\;{\text{critical}}\;{\text{reviews}}/{\text{total}}\;{\text{reviews }} $$
(8)

where total number of positive and critical reviews of the product are obtained from Amazon.com. Amazon.com provides the count of total positive and total critical reviews for every product. The total number of questions answered for each product is also obtained from Amazon.com.

Table 1 List of features

Ratio of positive reviews and Ratio of critical reviews: features are ratios of the positive and critical reviews to the total number of reviews as described in Eqs. (7) and (8). These features also play vital role in helpfulness prediction. In this study, these six features are selected on the basis of their impacts to predict the review helpfulness ratio. The influence of each product feature will be discussed in detail later in the experimental section.

Average rating per review is the potential score of the product and is computed using Eq. (6). A product with large potential score shows that there is less number of reviews posted on the product but all reviews got high rating. Furthermore, the product with high review rating and small number of total reviews will receive high potential score because small value of denominator (number of reviews) will generate higher potential score for the product. The reason why we added this feature is to investigate the impact of this ratio for review helpfulness.

4 Methodology

4.1 Data collection

In this study, a real-life dataset is prepared for the experiments. It is obtained by crawling the reviews from amazon.com website. There are total 40 different categories of products available at the amazon.com and 34 product categories are considered in this study from which review data is crawled. We only considered those products which comprises of top-10 best sellers from each product category. Out of 40 product categories, 34 product categories are selected because products of these 34 categories contain reviews with at least two helpful votes. The products from remaining 6 categories did not have required number of helpful votes that’s why they are dropped. Almost 1200 reviews are collected from each product category and initial data set has 40,800 reviews. The number of different products in the data are 3360. Then data cleaning process (Liu et al. 2008) is applied using three steps: (1) duplicate reviews are identified and removed. (2) reviews with large helpful votes but very low total votes are less useful. Therefore only those reviews are selected which have at least ten total votes. (3) reviews with blank text are also removed. After data cleaning, we have 32,434 reviews from 3100 different products as shown in the Table 2.

Table 2 Dataset description

It is the novel dataset that is used in this study in which products’ reviews from majority of product categories of amazon.com have been considered. The distribution of the helpfulness values versus density/relative frequency of the prepared dataset is presented in Fig. 1. Helpfulness ratio (score) varies from 0 to 1 and is represented by the x-axis whereas density/relative frequency is described by the y-axis. Figure 1 shows the relative frequency of # reviews at different helpfulness values. e.g. A large amount of reviews have helpfulness score ranges from 0.9 to 1 which indicates that density of helpfulness is skewed towards the right.

Fig. 1
figure 1

Distribution of helpfulness scores

4.2 Methods and performance metrics

In this research, Ensemble method, Multilayer perceptron with back propagation (MLP-BP), Multivariate adaptive regression (MARS), Generalized linear model (GLM) and Classification and regression tree (CART) are used as learning methods. It is the first study that used Ensemble method to build a predictive model for helpfulness of online reviews. R statistical programming language is used for all methods’ training and testing. It has built-in packages which provide interfaces for these algorithms. There are total seventeen determinants related to product; reviewer and review characteristics. The performance of learning methods are evaluated using standard error based metrics such as mean square error (MSE), root absolute error (RAE), root mean square error (RMSE), root relative square error (RRSE) and mean absolute error (MAE). The block diagram of the model adopted for review helpfulness prediction is presented in Fig. 2.

Fig. 2
figure 2

Review helpfulness model

4.2.1 Ensemble model

Ensembling is the technique of combining two or more machine learning methods of similar/dissimilar class. The methods which are combined are called base learners. The objective of combining is to build a more powerful predictive system that embeds the functionality of all base learners. Also combining multiple classifiers has been shown to be useful in traditional machine learning (Ho et al. 1994). In other way, it can be interpreted as a conference room meeting among multiple traders to make a decision on whether the price of a stock will go up or not. while making the final decision, all of the opinions of traders is to be considered to make the final decision more accurate, unbiased and robust. The ensemble methods are already utilized in classification of social media data in a variety of contexts (Banerjee et al. 2015; Liu and Jansen 2017).

There are three approaches which are mostly used to ensemble various models. i.e. bagging, boosting and stacking. we adapt stacking approach to ensemble methods (Casas et al. 2018; Healey et al. 2018). In stacking, multiple layers of the ML methods are designed one over another, where each of the methods transfer their predictions to the upper layer model. Finally, the top layer ML method takes decisions in the form of predictions based on the predictions of the lower layer methods. The block diagram of Ensemble model is presented in Fig. 3. Ordinary least square regression, linear ridge regression and gradient boosted machine are the bottom layer methods. Random forest as a top layer method takes inputs from three methods and predicts the final output.

Fig. 3
figure 3

Ensemble model (block diagram)

4.3 Experimental results

Series of experiments are conducted in this section to examine the impact of proposed variables on the degree of review helpfulness and performance of proposed features are also compared with state of the art baseline features. The experiments consist of helpfulness prediction analysis, category-wise feature analysis and significance of each feature on review helpfulness.

4.3.1 Helpfulness prediction

In this section, two types are experiments are conducted. In the first set of experiments, 32,434 instances of dataset are randomly split into two subsets, a training set and a test set. The number of samples in the training set and test set are 31,421 and 1013 respectively. It is a variation of v-fold cross validation method. Then five machine learning methods are developed using training and testing instances. Dataset is randomly divided into 32 subsets, each of which contains 1013 data points as test set and remaining as training set. The second set of experiments is conducted using popular tenfold cross validation method.

In the first set of experiments, five predictive models for helpfulness of online reviews are constructed using following ML methods (Ensemble, MLP-BP, MARS, GLM and CART). The models are trained and tested using hybrid set of features (product, reviewer and review) and crawled dataset. To compare the performance of five ML methods, a v-fold cross validation method is adapted in which total instances (32,434) of the dataset are randomly divided into 32 time-steps. For each time-step, there are two subsets: one subset consists of 1013 instances as test set and remaining (31,421) as training set except for the 32th time-step. Then five regression methods (Ensemble, MLP-BP, MARS, GLM and CART) are trained and tested for each time-step in which 31,421 data points are used as training samples and 1013 as test samples. Thus 32 test models are obtained for each regression method.

The regression methods produce the predicted values of the target variable (i.e. helpfulness of test samples). These predicted values are then compared with actual values of test samples and RMSEs are computed for all 32 test models. The values of RMSEs generated using each regression model is presented in Fig. 4. We can examine the predictive performance of five regression algorithms across 32 test models. The RMSEs generated by Ensemble model is lower than the RMSEs generated by MLP-BP, MARS, GLM and CART regression methods across all test models. Due to internal powerful architecture, Ensemble model outperformed the MLP-BP, MARS, GLM and CART methods. In addition, the small values of RMSEs indicate the effectiveness of the proposed features.

Fig. 4
figure 4

Prediction analysis using RMSE

The effectiveness of the proposed features on the degree of review helpfulness are also compared with the baseline (Lee and Choeh 2014) features as shown in Fig. 5. The baseline utilized twenty features related to product, reviewer and review categories. These twenty features are computed using crawled dataset presented in this research. Ensemble method is utilized to compare the predictive performance of proposed and baseline indicators for online review helpfulness. It is clearly highlighted in Fig. 5 that RMSEs obtained by proposed set of features are significantly lower than RMSEs obtained using baseline features for all 32 test samples respectively. This indicates that proposed features are more influential than baseline for review helpfulness and contains more effective and important variables that helps to predict review helpfulness accurately.

Fig. 5
figure 5

Comparison of proposed and baseline indicators (RMSE)

In the second experiment, most commonly used tenfold cross validation technique has been utilized to evaluate the predictive performance of Ensemble, MLP-BP, MARS, GLM and CART regression methods using proposed features. The baseline features (Lee and Choeh (2014)) are also used to compare the predictive performance of proposed features. The predicted values are compared with actual target values and prediction errors are computed in the form of MSEs, RMSEs, MAEs, RRSEs and RAEs as demonstrated in Table 3. We can analyze the predictive performance of proposed and baseline features for review helpfulness prediction. All error measures (MSE, RAE etc.) reveal lower values for proposed features with Ensemble and MLP-BP learning methods in comparison with baseline features. However, Ensemble model trained using proposed features show minimum errors (highlighted) as compared to MLP-BP, MARS, GLM and CART methods. This clearly shows the utility of proposed features against baseline features. In addition, our experimental results reveal that proposed features (product, reviewer and review indicators) produce best predictive performance for the review helpfulness prediction.

Table 3 Predictive performance analysis

4.3.2 Feature-wise analysis

Another set of experiments are conducted to investigate the influence of each type of features on the degree of review helpfulness using Ensemble method. The type of features are product, reviewer and review as described in Table 4. In the first step, Ensemble model is applied using product/reviewer/review type of features for review helpfulness prediction as a standalone model. Results demonstrate that reviewer features deliver best predictive performance as compared to product and review type features as a standalone model. This shows that reviewer features are most influential among product, reviewer and review features types to predict the review helpfulness. The errors generated by using reviewer features with Ensemble method are underlined in Table 4. In the second step, three combinations of features are constructed. i.e. Product and Review, Reviewer and Review and Product and Reviewer and Ensemble method is utilized. The combination of reviewer and review features demonstrate the best performance among three combinations and results are underlined. It has been observed (Table 4) that results of standalone best performer (features) are directly proportional to the results produced by their combinations. In addition, the hybrid (product, reviewer and review) set of features present best predictive accuracy (Table 3).

Table 4 Feature-wise performance analysis

4.3.3 Feature importance

The strength of each feature related to product, reviewer and review type and its significance for review helpfulness prediction are investigated in this section. Ensemble model is used as learning method. The objective of this experiment is to probe the percentage of contribution of each feature on the degree of review helpfulness. The importance of features based on mean square error is presented in Fig. 6. Among proposed reviewer features, It is noticed that reviewer activity length is the most effective determinant. The reviewer recency is the next effective feature. This reveals that reviewer with large recency and activity length definitely write reviews which attract more readerships and receive more helpful votes. Overall, reviewer rank is the most effective determinant. we already highlighted the importance of reviewer features in previous experiment.

Fig. 6
figure 6

MSE based importance of features

Among review features, number of comments attract high importance. Sentiment of review text and polarity are the next effective features. The high importance of number of comments reveals that customers prefer those reviews which receive more comments. The importance of sentiment and polarity of review text shows that reviews embedded with more emotion and sentiment words attract relatively more helpful votes. Number of question answered are the most important indicator of product type as shown in Fig. 6. This indicates that customers prefer those products reviews which receive more questions answered. Average rating per review and Ratio of positive reviews are the next most significant indicators. The importance of ratio of positive reviews indicates that product which receive large number of positive reviews influence customers and enable them to make purchase decisions. These results represent that the features proposed in this study under product, reviewer and review category play vital role in determining the review helpfulness.

Initially, larger set of features are considered. By conducting various experiments e.g. computation of feature importance and feature wise analysis, we found that features proposed in Sect. 3 are most influential for helpfulness prediction problem. Variable importance measures reveal that number of question answered, ratio of positive reviews, and average rating per review are the most influential features of product type. In addition, reviewer activity length, reviewer recency, number of comments, sentiment and polarity of review text are the most effective features related to reviewer and review categories. Thus our proposed features are the effective indicators for the helpfulness prediction of online reviews.

4.4 Discussions

Online reviews are an effective promotional tool for e-commerce entities, especially when information overload among online buyers is created due to large amount of information available on the World Wide Web (Cao et al. 2011; Lee and Choeh 2014; Hu and Chen 2016). Online reviews provide powerful, cheap and impactful channel for online vendors and marketers to reach their customers. Therefore, online vendors take benefit of opinions of experienced customers to attract potential buyers. The “helpfulness” feature of online product reviews facilitates the customers to cope with information overload problem and helps in their decision-making process (Cao et al. 2011; Krishnamoorthy 2015).

Identification of influential predictors for online review helpfulness has attracted much interest in the literature. However key insights may become indistinct without consideration of influential variables. Results of the current research has contributed to the relevant literature by providing robust helpfulness prediction model that makes use of three different types of influential features namely product, reviewer and review categories. In particular, the contributions are robust and consistent across a real world Amazon.com data set using five performance metrics. Additionally, the findings of the study have extended the results found in existing research (Mudambi and Schuff 2010) by looking at the reviewer features of online reviews (reviewer activity length, reviewer recency and reviewer rank) to analyze the influence of each of these features on online review helpfulness.

The proposed helpfulness prediction model is quite effective and demonstrates the minimum MSE of 0.019 using a real-life Amazon dataset as presented in Table 3. The proposed product, reviewer and review type features are established to be effective predictors in improving helpfulness predictive accuracy based on results presented in Fig. 4 and Tables 3 and 4. The Ensemble ML model is established to be more effective algorithm as compared to other four methods as evident from the results. The best predictive performance is obtained from hybrid set of features. The performance of hybrid features is also compared with baseline features (Lee and Choeh 2014) as shown in Fig. 5 and Table 3. It has been observed that proposed features present superior performance as compared to baseline features. Furthermore, a standalone model that uses reviewer features delivers superior performance as compared to a model that uses either review or product features. In addition, a model that incorporates combination of reviewer and review features demonstrates better performance as compared to a model that uses combination of product and reviewer or product and review features respectively.

In order to identify effective determinants for review helpfulness, the importance of each feature is computed. The findings in terms of variable significance clearly indicate that reviewer activity length, reviewer recency, number of comments on review, sentiment and polarity of review text, number of question answered, ratio of positive reviews, and average rating per review are the most significant parameters to determine the review helpfulness as presented in Fig. 6. Among product characteristics, ratio of positive reviews has strong correlation to review helpfulness than ratio of critical reviews and average review rating of the product. Previous study (Lee and Choeh 2014) showed that products with extreme ratings received more helpful reviews. In addition, as positive reviews are more likely to be seen as helpful by consumers, therefore ratio of positive reviews has strong relationship to helpfulness (Cao et al. 2011) which is verified by the proposed study. Our findings also reveal that across review sentiment in terms of rating, favorable reviews are not only most commonly available among all review sentiment but are also most widely voted as helpful. Perhaps, consumers look for the confirmation of evidence when they browse reviews prior to making a purchase (Hammond et al. 1998). In consequence, favorable reviews attract more endorsements whereas unfavorable or mixed entries could remain largely ignored (Chen and Huang 2013).

From the four reviewer factors that we analyzed in this study, reviewer activity length, reviewer recency and reviewer rank have very strong relationship to helpfulness. In particular, reviewer activity length and reviewer recency turned out to be statistically more significant than others. Although past performance may not always indicate future effects, our findings endorse that there are some trails of consistency in review quality by experienced reviewers. Similar findings are also supported by the past studies (Ghose and Ipeirotis 2011; Ngo-Ye and Sinha 2014). Reviewer rank plays a significant role in improving the review helpfulness prediction. Since reviewer rank is measured by three factors (1) How many reviews have been written by the user (2) How many helpful votes has been received on these reviews (3) How recent are the reviews? It seems that consistently writing quality reviews will definitely translate into review helpfulness similar to prior studies (Lee and Choeh 2014). However, reviewer’s ‘total reviews’ feature is not significantly correlated to review helpfulness. It shows that the volume of reviews does not translate into review helpfulness.

Similar to all empirical studies, this study also has some limitations. First, data is collected from Amazon.com review web site. Although Amazon is a well-known online retailer which provides users with opportunities to leave their feedbacks on any product they purchase. Similarity, other retailers such as Rakuten.com and Newegg.com follow the similar trend for customers to review by creating functionalities on the web. Furthermore, customer reviews are not only limited to online retailers but also to physical stores, such as Wal-Mart and best buy etc. Amazon is one among several other retailers that welcomes online customer reviews. Thus by utilizing the data collected from Amazon to generalize for the overall market would be biased. Customers should be cautioned not to generalize the results beyond the intended context. However, the size of this organization and the volumes of trades that it processes have made it one of the largest online retailers. The sample drawn from this retailer none the less still represents a large proportion of the true online retailing population. Second, the data for testing the effectiveness of proposed features are crawled by comparing the products that comprise the top-10 best sellers from each product category. In this way, the results of this study are more appropriately applicable to the products which comprises of top-50 or top-100 best sellers from each product category. Future studies can include more products or different brands in order to further enhance findings presented in this article.

5 Implications

The findings of this research have several practical implications for research in this field. Prior developments for helpfulness of online reviews have adopted different approaches and produced worthy but diverse, sometimes inconsistent findings. The main idea of our research is to present an effective model for review helpfulness prediction. First, it is the first study that utilizes Ensemble model for online review helpfulness prediction besides proposing effective features which play vital role in helpfulness prediction as presented in detailed experimentation. Ensemble models are proved to be more appropriate for analyzing relationships of complex patterns among determinants by combining robust ML algorithms than individual approaches because it can capture linear and non-linear complex relationship in the data. Ensembling have attracted more popularity as more organizations have deployed the computing resources and advanced analytics tools needed to run such models.

The Ensemble model delivers best results as compared to conventional machine learning models. Therefore it is used to explore the impacts of influential variables and recommends the best combination of them. This study demonstrates that reviewer, review and product characteristics have varying impacts on helpfulness of online reviews. Some effective features related to product, reviewer and review characteristics are proposed that are never used in the literature. The number of comments, sentiment and polarity score of review text are important characteristics of reviews for helpfulness prediction model. The characteristics of a reviewer such as reviewer activity length, reviewer recency and reviewer rank are most effective determinants. Similarly, number of question answered, the ratio of positive reviews and average rating per review are influential features of product type. Generally this study presents that review helpfulness is a complex concept. These ideas can be effectively applied in related fields such as opinion mining, review summarization, recommendation and sentiment analysis tasks.

Number of practical implications may be derived from this study. First, the developed review helpfulness prediction model can be served as a guideline to develop a smart review recommendation system for product websites. Specifically, when a consumer browses the reviews of the selected product on a product website, the system can automatically identify useful reviews according to sentiment and polarity of review and product rating combination of the target product. This is a highly desirable feature as product websites will be able to offer a deeper level of adaptive filtering. Because online viewers usually have limited time to handle a large amount of online product reviews, this system can help users quickly grasp important information of the selected products and thus save time during their online shopping process by providing better organization of product reviews.

This research also exposes a potential loophole of how users can enhance their reputation in the community. Given the positivity bias, if customers consistently post favorable reviews, their submissions would accumulate helpful votes irrespective of information quality. This serves as an unfair short-cut to grow repute. As users join the bandwagon of submitting favorable reviews just to establish their reputation, unfavorable and mixed entries may be pushed toward extinction. Perhaps, such a trend is already catching up as favorable reviews are more prevalent than unfavorable or mixed ones (Hu et al. 2009). This is undesirable because the diversity of opinions is the key strength of user-generated content (Kaplan and Haenlein (2010)). Therefore, review websites could encourage consumers to submit reviews of different sentiment. Moreover, consumers could be encouraged to browse a variety of reviews, and not merely look for favorable ones before making a purchase decision.

6 Conclusions

This study addresses the problem of helpfulness prediction of online reviews and built an effective predictive model using five machine learning methods including Ensemble model. Influential features related to three feature categories: product, reviewer and review are proposed. The hybrid set of features (product, reviewer, review) deliver best predictive results. The performance of Ensemble model is better than MLP-BP, MARS, GLM and CART regression models. The predictive performance of proposed features is also compared with baseline features using same data set and same regression method. Experimental results show that the proposed features outperform the baseline features for predicting review helpfulness using various evaluation metrics. The category-wise feature analysis is also conducted and experiments show that reviewer and review features are the best predictors for helpfulness of online reviews as a standalone model. In addition, the importance of each feature is also examined and list of influential features belong to each category are highlighted. Variable importance measures reveal that number of question answered, ratio of positive reviews, and average rating per review are the most influential features of product type. In addition, reviewer activity length, reviewer recency, number of comments, sentiment and polarity of review text are the most effective features related to reviewer and review categories. Thus our proposed features are the effective indicators for the helpfulness prediction of online reviews.

This research can be extended in multiple dimensions. Authors will focus for the use of semantic indicators, reviewer identity and social features to investigate their impacts on review helpfulness. Furthermore, effective reviewer and review features can be applied in other domains such as to identify the influential reviewers of online review platforms to effectively rank the reviewers. Similarly, effective product ranking can be computed on the basis of proposed features. One of the future extensions is to apply proposed variables in other domains such as, recommendation systems, text summarization and spam filtering etc.