Skip to main content
Log in

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

As fake reviews become more prominent on the web, a method to differentiate between untruthful and truthful reviews becomes increasingly necessary. However, detection of false reviews may be difficult, as determining the validity of a review based solely on text can be nearly impossible for a human. In this study, we aim to determine the effectiveness of machine learning techniques, specifically ensemble techniques and the combination of feature selection and ensemble techniques, for the detection of spam reviews. In addition to traditional ensemble techniques, such as Boosting and Bagging, we employ techniques that combine ensemble methods with a form of feature selection: Select-Boost, Select-Bagging and Random Forest. For Select-Boost and Select-Bagging, we combine the Boosting and Bagging approaches with three different feature rankers. Random Forest was performed using 100, 250, and 500 trees. Our results show a combination of Select-Boost, multinomial naïve Bayes and, either Chi-squared or signal-to-noise, significantly outperforms all methods except Random Forest using 500 trees. There is no significant difference between the feature subset sizes tested when using Select-Boost with multinomial naïve Bayes, regardless of the feature selection technique employed. To the best of our knowledge, this is the first study to examine the effect of a combination of ensemble techniques and feature selection in the domain of spam review detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Berenson ML, Goldstein M (1983) Intermediate statistical methods and applications: a computer package approach, 2nd edn. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655

    MATH  Google Scholar 

  • Breitling R, Herzyk P (2005) Rank-based methods as a non-parametric alternative of the t-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 3(5):1171–1189. http://view.ncbi.nlm.nih.gov/pubmed/16278953

  • Broder A (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences 1997, pp 21–29

  • Buhrmester K, Goslining (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality data? Perspect Psychol Sci 6(1):3–5. http://pps.sagepub.com/content/6/1/3

  • Chen X, Wasikowski M (2008) Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference knowledge discovery and data mining (KDD ’08). ACM, New York, Aug 2008, pp 124–132

  • Crawford M, Khoshgoftaar TM, Prusa JD (2016) Reducing feature set explosion to facilitate real-world review spam detection. In: Proceedings of the 29th international FLAIRS conference, pp 304–309

  • Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):1–24. http://link.springer.com/article/10.1186/s40537-015-0029-9

  • Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on Multiple classifier systems, pp 1–15

  • Dittman DJ, Khoshgoftaar TM, Wald R, Van Hulse J (2010) Comparative analysis of dna microarray data through the use of feature selection techniques. In: Proceedings of the ninth IEEE international conference on machine learning and applications (ICMLA), vol 1857. Springer, Berlin, Heidelberg, pp 147–152

  • Dixit S, Agrawal A (2013) Survey on review spam detection. Int J Comput Commun Technol 4(2):68–72. http://interscience.in/IJCCT_Vol4Iss2/68-72.pdf

  • Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  • Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Haykin S (1998) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  • Heredia B, Khoshgoftaar TM, Prusa JD, Crawford M (2016) An investigation of ensemble techniques for detection of spam reviews. In: 2016 15th international conference on machine learning and applications (ICMLA), pp 127–133

  • Hosmer DW Jr, Lemeshow S (2004) Applied logistic regression. Wiley, Hoboken

    MATH  Google Scholar 

  • Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification, Technical report, Department of Computer Science, National Taiwan University

  • I.C. Government of Canada (2014) Don’t buy into fake online endorsements—not all reviews are from legitimate consumers. http://www.competitionbureau.gc.ca/eic/site/cb-bc.nsf/eng/03782.html

  • Jindal N, Lui B (2008) Opinion spam and analysis. In: Proceedings of the 2008 international conference on web search and data mining. https://www.cs.uic.edu/~liub/FBS/opinion-spam-WSDM-08.pdf

  • Khoshgoftaar TM, Dittman DJ, Wald R, Fazelpour A (2012) First order statistics based feature selection: a diverse and powerful family of feature selection techniques. In: Proceedings of the eleventh international conference on machine learning and applications (ICMLA). ICMLA, pp 151–157

  • Li J, Myle O, Cardie C, Hovy E (2014) Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, pp 1556–1576. http://anthology.aclweb.org/P/P14/P14-1147.pdf

  • McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization

  • Mukherjee A, Venkataraman V, Liu B, Glance N (2013) What yelp fake review filter might be doing? In: Seventh international AAAI conference on weblogs and social media. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6006

  • Ott M, Choi Y, Cardie C, Hancock J (2011) Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 309–319

  • Peng H, Long L, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  • Prusa JD, Khoshgoftaar TM, Dittman DJ (2015) Impact of feature selection techniques for tweet sentiment classification. In: Proceedings of the 28th international FLAIRS conference, pp 299–304

  • Prusa JD, Khoshgoftaar TM, Napolitano A (2015) Using feature selection in combination with ensemble learning techniques to improve tweet sentiment classification performance. In: Proceedings of the 27th international conference on tools with artificial intelligence, pp 186–193

  • Quinlan RJ (2014) C4.5: programs for machine learning. Elsevier, Amsterdam

    Google Scholar 

  • Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence

  • Shojaee S, Murad M, Sharef N, Nadali S (2013) Detecting deceptive reviews using lexical and syntactic features. In: 2013 13th international conference on intelligent systems design and applications (ISDA)

  • Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98(9):5116–5121. http://www.pnas.org/content/98/9/5116.abstract

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Burlington

    MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and the Editor for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for assistance with the reviews. Also, we acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors and do not reflect the views of the NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brian Heredia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heredia, B., Khoshgoftaar, T.M., Prusa, J.D. et al. Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection. Soc. Netw. Anal. Min. 7, 37 (2017). https://doi.org/10.1007/s13278-017-0456-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-017-0456-z

Keywords

Navigation