Abstract
As fake reviews become more prominent on the web, a method to differentiate between untruthful and truthful reviews becomes increasingly necessary. However, detection of false reviews may be difficult, as determining the validity of a review based solely on text can be nearly impossible for a human. In this study, we aim to determine the effectiveness of machine learning techniques, specifically ensemble techniques and the combination of feature selection and ensemble techniques, for the detection of spam reviews. In addition to traditional ensemble techniques, such as Boosting and Bagging, we employ techniques that combine ensemble methods with a form of feature selection: Select-Boost, Select-Bagging and Random Forest. For Select-Boost and Select-Bagging, we combine the Boosting and Bagging approaches with three different feature rankers. Random Forest was performed using 100, 250, and 500 trees. Our results show a combination of Select-Boost, multinomial naïve Bayes and, either Chi-squared or signal-to-noise, significantly outperforms all methods except Random Forest using 500 trees. There is no significant difference between the feature subset sizes tested when using Select-Boost with multinomial naïve Bayes, regardless of the feature selection technique employed. To the best of our knowledge, this is the first study to examine the effect of a combination of ensemble techniques and feature selection in the domain of spam review detection.
Similar content being viewed by others
References
Berenson ML, Goldstein M (1983) Intermediate statistical methods and applications: a computer package approach, 2nd edn. Prentice Hall, Upper Saddle River
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655
Breitling R, Herzyk P (2005) Rank-based methods as a non-parametric alternative of the t-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 3(5):1171–1189. http://view.ncbi.nlm.nih.gov/pubmed/16278953
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences 1997, pp 21–29
Buhrmester K, Goslining (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality data? Perspect Psychol Sci 6(1):3–5. http://pps.sagepub.com/content/6/1/3
Chen X, Wasikowski M (2008) Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference knowledge discovery and data mining (KDD ’08). ACM, New York, Aug 2008, pp 124–132
Crawford M, Khoshgoftaar TM, Prusa JD (2016) Reducing feature set explosion to facilitate real-world review spam detection. In: Proceedings of the 29th international FLAIRS conference, pp 304–309
Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):1–24. http://link.springer.com/article/10.1186/s40537-015-0029-9
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on Multiple classifier systems, pp 1–15
Dittman DJ, Khoshgoftaar TM, Wald R, Van Hulse J (2010) Comparative analysis of dna microarray data through the use of feature selection techniques. In: Proceedings of the ninth IEEE international conference on machine learning and applications (ICMLA), vol 1857. Springer, Berlin, Heidelberg, pp 147–152
Dixit S, Agrawal A (2013) Survey on review spam detection. Int J Comput Commun Technol 4(2):68–72. http://interscience.in/IJCCT_Vol4Iss2/68-72.pdf
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Haykin S (1998) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, Upper Saddle River
Heredia B, Khoshgoftaar TM, Prusa JD, Crawford M (2016) An investigation of ensemble techniques for detection of spam reviews. In: 2016 15th international conference on machine learning and applications (ICMLA), pp 127–133
Hosmer DW Jr, Lemeshow S (2004) Applied logistic regression. Wiley, Hoboken
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification, Technical report, Department of Computer Science, National Taiwan University
I.C. Government of Canada (2014) Don’t buy into fake online endorsements—not all reviews are from legitimate consumers. http://www.competitionbureau.gc.ca/eic/site/cb-bc.nsf/eng/03782.html
Jindal N, Lui B (2008) Opinion spam and analysis. In: Proceedings of the 2008 international conference on web search and data mining. https://www.cs.uic.edu/~liub/FBS/opinion-spam-WSDM-08.pdf
Khoshgoftaar TM, Dittman DJ, Wald R, Fazelpour A (2012) First order statistics based feature selection: a diverse and powerful family of feature selection techniques. In: Proceedings of the eleventh international conference on machine learning and applications (ICMLA). ICMLA, pp 151–157
Li J, Myle O, Cardie C, Hovy E (2014) Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, pp 1556–1576. http://anthology.aclweb.org/P/P14/P14-1147.pdf
McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization
Mukherjee A, Venkataraman V, Liu B, Glance N (2013) What yelp fake review filter might be doing? In: Seventh international AAAI conference on weblogs and social media. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6006
Ott M, Choi Y, Cardie C, Hancock J (2011) Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 309–319
Peng H, Long L, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Prusa JD, Khoshgoftaar TM, Dittman DJ (2015) Impact of feature selection techniques for tweet sentiment classification. In: Proceedings of the 28th international FLAIRS conference, pp 299–304
Prusa JD, Khoshgoftaar TM, Napolitano A (2015) Using feature selection in combination with ensemble learning techniques to improve tweet sentiment classification performance. In: Proceedings of the 27th international conference on tools with artificial intelligence, pp 186–193
Quinlan RJ (2014) C4.5: programs for machine learning. Elsevier, Amsterdam
Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence
Shojaee S, Murad M, Sharef N, Nadali S (2013) Detecting deceptive reviews using lexical and syntactic features. In: 2013 13th international conference on intelligent systems design and applications (ISDA)
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98(9):5116–5121. http://www.pnas.org/content/98/9/5116.abstract
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Burlington
Acknowledgements
The authors would like to thank the anonymous reviewers and the Editor for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for assistance with the reviews. Also, we acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors and do not reflect the views of the NSF.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Heredia, B., Khoshgoftaar, T.M., Prusa, J.D. et al. Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection. Soc. Netw. Anal. Min. 7, 37 (2017). https://doi.org/10.1007/s13278-017-0456-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-017-0456-z