Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Heredia, Brian; Khoshgoftaar, Taghi M.; Prusa, Joseph D.; Crawford, Michael

doi:10.1007/s13278-017-0456-z

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Original Article
Published: 04 August 2017

Volume 7, article number 37, (2017)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Brian Heredia ORCID: orcid.org/0000-0002-9946-9022¹,
Taghi M. Khoshgoftaar¹,
Joseph D. Prusa¹ &
…
Michael Crawford¹

552 Accesses
10 Citations
Explore all metrics

Abstract

As fake reviews become more prominent on the web, a method to differentiate between untruthful and truthful reviews becomes increasingly necessary. However, detection of false reviews may be difficult, as determining the validity of a review based solely on text can be nearly impossible for a human. In this study, we aim to determine the effectiveness of machine learning techniques, specifically ensemble techniques and the combination of feature selection and ensemble techniques, for the detection of spam reviews. In addition to traditional ensemble techniques, such as Boosting and Bagging, we employ techniques that combine ensemble methods with a form of feature selection: Select-Boost, Select-Bagging and Random Forest. For Select-Boost and Select-Bagging, we combine the Boosting and Bagging approaches with three different feature rankers. Random Forest was performed using 100, 250, and 500 trees. Our results show a combination of Select-Boost, multinomial naïve Bayes and, either Chi-squared or signal-to-noise, significantly outperforms all methods except Random Forest using 500 trees. There is no significant difference between the feature subset sizes tested when using Select-Boost with multinomial naïve Bayes, regardless of the feature selection technique employed. To the best of our knowledge, this is the first study to examine the effect of a combination of ensemble techniques and feature selection in the domain of spam review detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spam Review Detection Using Ensemble Machine Learning

Twitter Spam Review Detection Using Hybrid Machine Learning Techniques

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Article 13 January 2021

References

Berenson ML, Goldstein M (1983) Intermediate statistical methods and applications: a computer package approach, 2nd edn. Prentice Hall, Upper Saddle River
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655
MATH Google Scholar
Breitling R, Herzyk P (2005) Rank-based methods as a non-parametric alternative of the t-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 3(5):1171–1189. http://view.ncbi.nlm.nih.gov/pubmed/16278953
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences 1997, pp 21–29
Buhrmester K, Goslining (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality data? Perspect Psychol Sci 6(1):3–5. http://pps.sagepub.com/content/6/1/3
Chen X, Wasikowski M (2008) Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference knowledge discovery and data mining (KDD ’08). ACM, New York, Aug 2008, pp 124–132
Crawford M, Khoshgoftaar TM, Prusa JD (2016) Reducing feature set explosion to facilitate real-world review spam detection. In: Proceedings of the 29th international FLAIRS conference, pp 304–309
Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):1–24. http://link.springer.com/article/10.1186/s40537-015-0029-9
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on Multiple classifier systems, pp 1–15
Dittman DJ, Khoshgoftaar TM, Wald R, Van Hulse J (2010) Comparative analysis of dna microarray data through the use of feature selection techniques. In: Proceedings of the ninth IEEE international conference on machine learning and applications (ICMLA), vol 1857. Springer, Berlin, Heidelberg, pp 147–152
Dixit S, Agrawal A (2013) Survey on review spam detection. Int J Comput Commun Technol 4(2):68–72. http://interscience.in/IJCCT_Vol4Iss2/68-72.pdf
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Haykin S (1998) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, Upper Saddle River
MATH Google Scholar
Heredia B, Khoshgoftaar TM, Prusa JD, Crawford M (2016) An investigation of ensemble techniques for detection of spam reviews. In: 2016 15th international conference on machine learning and applications (ICMLA), pp 127–133
Hosmer DW Jr, Lemeshow S (2004) Applied logistic regression. Wiley, Hoboken
MATH Google Scholar
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification, Technical report, Department of Computer Science, National Taiwan University
I.C. Government of Canada (2014) Don’t buy into fake online endorsements—not all reviews are from legitimate consumers. http://www.competitionbureau.gc.ca/eic/site/cb-bc.nsf/eng/03782.html
Jindal N, Lui B (2008) Opinion spam and analysis. In: Proceedings of the 2008 international conference on web search and data mining. https://www.cs.uic.edu/~liub/FBS/opinion-spam-WSDM-08.pdf
Khoshgoftaar TM, Dittman DJ, Wald R, Fazelpour A (2012) First order statistics based feature selection: a diverse and powerful family of feature selection techniques. In: Proceedings of the eleventh international conference on machine learning and applications (ICMLA). ICMLA, pp 151–157
Li J, Myle O, Cardie C, Hovy E (2014) Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, pp 1556–1576. http://anthology.aclweb.org/P/P14/P14-1147.pdf
McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization
Mukherjee A, Venkataraman V, Liu B, Glance N (2013) What yelp fake review filter might be doing? In: Seventh international AAAI conference on weblogs and social media. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6006
Ott M, Choi Y, Cardie C, Hancock J (2011) Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 309–319
Peng H, Long L, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Prusa JD, Khoshgoftaar TM, Dittman DJ (2015) Impact of feature selection techniques for tweet sentiment classification. In: Proceedings of the 28th international FLAIRS conference, pp 299–304
Prusa JD, Khoshgoftaar TM, Napolitano A (2015) Using feature selection in combination with ensemble learning techniques to improve tweet sentiment classification performance. In: Proceedings of the 27th international conference on tools with artificial intelligence, pp 186–193
Quinlan RJ (2014) C4.5: programs for machine learning. Elsevier, Amsterdam
Google Scholar
Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence
Shojaee S, Murad M, Sharef N, Nadali S (2013) Detecting deceptive reviews using lexical and syntactic features. In: 2013 13th international conference on intelligent systems design and applications (ISDA)
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98(9):5116–5121. http://www.pnas.org/content/98/9/5116.abstract
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Burlington
MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and the Editor for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for assistance with the reviews. Also, we acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors and do not reflect the views of the NSF.

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, USA
Brian Heredia, Taghi M. Khoshgoftaar, Joseph D. Prusa & Michael Crawford

Authors

Brian Heredia
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Joseph D. Prusa
View author publications
You can also search for this author in PubMed Google Scholar
Michael Crawford
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brian Heredia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heredia, B., Khoshgoftaar, T.M., Prusa, J.D. et al. Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection. Soc. Netw. Anal. Min. 7, 37 (2017). https://doi.org/10.1007/s13278-017-0456-z

Download citation

Received: 27 March 2017
Revised: 18 July 2017
Accepted: 20 July 2017
Published: 04 August 2017
DOI: https://doi.org/10.1007/s13278-017-0456-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Abstract

Access this article

Similar content being viewed by others

Spam Review Detection Using Ensemble Machine Learning

Twitter Spam Review Detection Using Hybrid Machine Learning Techniques

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Abstract

Access this article

Similar content being viewed by others

Spam Review Detection Using Ensemble Machine Learning

Twitter Spam Review Detection Using Hybrid Machine Learning Techniques

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation