An SVD-Entropy and bilinearity based product ranking algorithm using heterogeneous data

doi:10.1016/j.jvlc.2017.06.001

Journal of Visual Languages & Computing

Volume 41, August 2017, Pages 133-141

https://doi.org/10.1016/j.jvlc.2017.06.001 Get rights and content

Abstract

E-commerce websites, besides selling products and services, pay ample emphasis on providing a platform for consumers to share their opinions about past and potential purchases. They share such opinions as product reviews (star ratings, plain text, etc.) and answering product related questions (Q&A data). There are several machine learning and classification approaches available to scrutinize this review data, e.g., algorithms based on Entropy measures, Bilinear Similarity, stochastic methods, etc. In this paper, we review some of the prevalent review classification techniques and present a hybrid approach, involving Singular Value Decomposition (SVD), Entropy and Bilinear Similarity measures, that uses heterogeneous product data and simultaneously analyze and rank products for customers. With experimental results, we show that our approach effectively ranks products using (1) text reviews (2) Q&A data (3) five-star rating of products and has 10% improved prediction accuracy as compared to the individual approaches. Also, using SVD, we achieve a 35% runtime efficiency for our algorithm while only sacrificing 1% of the prediction accuracy.

Introduction

Product review has emerged as a novel field of research and has valuable implications in the real world. Many commercial commerce websites, such as Amazon.com, Angie's List, Google, Yelp, etc., provide a platform for consumers to share their opinions. In 1995, Amazon was one of the first online stores to allow customers to post reviews for products. Amazon remains one of the most valuable resources for consumers to make informed decisions for purchasing products [1]. Similarly, Yelp is a free review site where consumers can rate businesses on a five-star scale. Unbiased reviews by customers build the confidence of new consumers to go ahead with the transactions [2]. The reviews can be quantitative in the form of a five-star rating (0–5), or qualitative in the form of plain text comments or textual answers to questions. However, with the ever growing volume of available review data and the limited time and attention span of customers, researchers are focusing on advanced mining methods to ascertain the quality, authenticity, and usefulness of reviews. The goal is to find information buried in the noisy data so as to help the customer make an informed decision. The existing data mining techniques sometimes lack the novelty and flexibility to work well with the new age applications and there is a growing need to combine the features of various available algorithms into advanced practical methods.

The product rating with five stars (or any numerical scale 0–5 or 0–10 etc.) alone does not give the sentiment of the review as the opinion semantics text does [3], [4]. Hence, nowadays the trend is to use more than one type of consumer sentiment in the surveys about products and services. However, though it is quick and easy to process quantitative ratings, qualitative information is vague because the deciphering and evaluation of text reviews are time-consuming due to an unstructured form of natural language text [2]. In fact, it is a separate area of Information Retrieval and Natural Language Processing where text character similarity and semantic similarity are used.

Reviews remain a primary source of information for consumers, producers, and marketers [5]. However, reviews are subjective opinions and judgment about a product or a service. The availability of search and evaluation tools makes the consumers knowledgeable as well as comfortable to distinguish between forged and genuine reviews from their implied reputation [6]. Fortunately, there are advanced techniques available to scrutinize the reviews for the validity of information [7] such as using the reliable loyal highly ranked reviewers to authenticate the quality of reviews by factoring the reviews they have written.

There are several approaches to ascertain the quality, authenticity and usefulness of the qualitative and quantitative reviews: Entropy, Decision trees, SVD, Information Retrieval, Natural Language Processing, Support Vector Machines, Machine Learning, Stochastic probability, etc. These approaches focus on (1) evaluating the helpfulness of the review data, (2) ranking products based off of the relevance and ratings from the reviews, and (3) reducing the consumer search and discovery time. However, no single algorithm can achieve all three. In this paper, we design an improved algorithm based of a combination of mining and classification approaches to analyze the review data and help consumers determine the good and bad features of the product before they buy it.

In this paper, we use heterogeneous Amazon review data for our experiments, same as in [8]. However, in the previous work, most of the results were theoretical. In this paper, we provide the results of running individual and hybrid algorithms on the datasets highlighting the prediction accuracy of the new algorithm. Also, since datasets are large, processing is time consuming. We applied SVD to data before processing. In the interest of efficiency and effectiveness of the algorithm, data size is reduced while keeping the prediction accuracy intact. We will discuss these merits in detail later in the chapter.

The paper is organized as follows: Section 2 describes the background and literature review about SVD-Entropy and Bilinear similarity. Section 3 describes our contribution, a hybrid algorithm that is designed to surpass the performance and accuracy of the individual techniques. Section 4 is on experiments that confirm the hybrid algorithm performs as expected and Section 5 concludes with summarization.

Section snippets

Preliminaries

Data analysis starts with a set of foundations adapted from several fields including statistics, mathematics, social sciences, natural sciences, and computer science. There are four stages of data analysis: data collection, data preparation/transformation, data analysis algorithms, and put the algorithm into practice [9].

Here we describe the standard terms used in this work and discussion [10]. Data instance/object is in the form of a vector. We assume all vectors are column vectors. A matrix

Hybrid adaptive algorithm

Our objective is to develop an adaptive hybrid algorithm using a combination of existing techniques to rate the products consistently based on consumer sentiment. We take advantage of many of the approaches surveyed above into a single adaptive hybrid algorithm to rank products. We use the following heterogeneous information related to a product as inputs to our approach. Once the quantitative rating is computed, the ranking is just indexing of products based on quantitative measures. The

Empirical evaluation and discussion

We start by reducing the product review data space with SVD and then calculate the Entropy measure of the transformed product reviews. The SVD transformation resulted in data reduction by a factor of 3 resulting in 35% improvement in overall running time. Using the Entropy based classification model as explained in Section 2.4, we calculate the Score_i for the reviews per product ‘i.' We use k-fold cross validation, k= 10, to optimize the model parameters to make the model fit the training data

Conclusion

Several commercial e-commerce websites provide a platform for consumers to share their opinions. In this paper, we developed an algorithm to process heterogeneous survey data for ranking consumer products. The data consists of three diverse representations: five star rating, Q&A text, reviews text. Our hybrid approach takes the best features of analysis methods: two-pronged similarities in terms of Entropy and Bilinear similarity. We have used three diverse categories of Amazon.com data: k-core

References (29)

M. Banerjee et al.
Feature selection with SVD Entropy: some modification and extension
Inf. Sci.
(2014)
L. Kolowich, [Online] 19 Online Review Sites for Collecting, Business & Product Reviews, Hubspot, September 17,...
G. Lackermair et al.
Importance of online product reviews from a consumer's perspective
Adv. Econ. Bus.
(2013)
P.D. Turney
Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews
B. Liu
Sentiment Analysis and Opinion Mining
(2012)
[Online] A. Haghighi, How to approach machine learning as a non-technical person. Crunch Network, Apr 2, 2016....
D.H. Park et al.
The effect of on-line consumer reviews on consumer purchasing intention: the moderating role of involvement
Int. J. Electron. Commer.
(2007)
L. Dragon
Let's talk about Amazon reviews: how we spot the fakes
(2016)
B. Anjum et al.
An entropy based product ranking algorithm using reviews and Q&A data
L. Getoor (Chair), D. Culler, E. Sturler, D. Ebert, M. Franklin, and H.V. Jagadish on behalf of the CRA Board,...

J. Hefferon. [Online] Linear Algebra. http://joshua.smcvt.edu/linearalgebra,...

L. Zhang et al.

Generalizing matrix factorization through flexible regression priors

C.D. Manning et al.

An Introduction To Information Retrieval

(2009)

H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: separating facts from opinions and identifying the...

Cited by (0)

View full text

An SVD-Entropy and bilinearity based product ranking algorithm using heterogeneous data

Abstract

Introduction

Section snippets

Preliminaries

Hybrid adaptive algorithm

Empirical evaluation and discussion

Conclusion

Inf. Sci.

Importance of online product reviews from a consumer's perspective

Adv. Econ. Bus.

Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews

Sentiment Analysis and Opinion Mining

The effect of on-line consumer reviews on consumer purchasing intention: the moderating role of involvement

Int. J. Electron. Commer.

Let's talk about Amazon reviews: how we spot the fakes

An entropy based product ranking algorithm using reviews and Q&A data

Generalizing matrix factorization through flexible regression priors

An Introduction To Information Retrieval