Skip to main content
Log in

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

  • Published:
Electronic Commerce Research Aims and scope Submit manuscript

Abstract

With the rise of social web, there has also been a great concern about the quality of user-generated content on social media sites (SMSs). Deceptive comments harm users’ trust in online social media and cause financial loss to firms. Previous studies use various features and classification algorithms to detect and filter social spam on several social media platforms. However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. Thus, the main contribution of this paper is design of a novel detection methodology that combines topic- and user-based features to improve the effectiveness of social spam detection. The proposed methodology exploits a probabilistic generative model, namely the labeled latent Dirichlet allocation (L-LDA), for mining the latent semantics from user-generated comments, and an incremental learning approach for tackling the changing feature space. An experiment based on a large dataset extracted from YouTube demonstrates the effectiveness of our proposed methodology, which achieves an average accuracy of 91.17 % in social spam detection. Our statistical analysis reveals that topic-based features significantly improve social spam detection, which has significant implications for business practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/.

  2. http://www.internetlivestats.com/twitter-statistics/.

  3. See “$611,000 fine as TripAdvisor gets bad review in Italy” by Barry Neild, Dec’14, available at http://edition.cnn.com/2014/12/23/travel/tripadvisor-fine/.

  4. See “Fake online reviews: 4 ways companies can deceive you” by Megan Griffith-Greene, Nov’14, available at http://www.cbc.ca/news/business/fake-online-reviews-4-ways-companies-can-deceive-you-1.2825080.

  5. A Chinese microblogging website (www.weibo.com).

References

  1. van Marle, D. (2011) IP telephony shifts from unified communications to social media. In Proceedings of the 50th FITCE Congress, 2011 (pp. 1–4). Piscataway: IEEE

  2. Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: What is new from databases perspective? In Big Data Analytics (pp. 42–61). Berlin: Springer.

  3. Chandramouli, R. (2011). Emerging social media threats: Technology and policy perspectives. In Proceedings of the 2nd Worldwide Cybersecurity Summit (WCS), London (pp. 1–4). Piscataway: IEEE

  4. Zhou, L., Wu, J., & Zhang, D. (2014). Discourse cues to deception in the case of multiple receivers. Information & Management, 51(6), 726–737. doi:10.1016/j.im.2014.05.011.

    Article  Google Scholar 

  5. Wu, G., Greene, D., Smyth, B., & Cunningham, P. A. (2010) Distortion as a validation criterion in the identification of suspicious reviews. In Proceedings of the 1st Workshop on Social Media Analytics, New York (pp. 10–13, SOMA ‘10): Association of Computing Machinery (ACM). doi:10.1145/1964858.1964860.

  6. Yoo, K.-H., & Gretzel, U. (2009). Comparison of deceptive and truthful travel reviews. In W. Höpken, U. Gretzel, & R. Law (Eds.), Information and Communication Technologies in Tourism 2009 (pp. 37–47). Vienna: Springer.

  7. Theft, fraud cost retailers $8 million a day: study. (2007), The Ottawa Citizen, pp. E.3-E3.

  8. Wang, D., Irani, D., & Pu, C. (2014). SPADE: A social-spam analytics and detection framework. Social Network Analysis and Mining, 4(1), 1–18. doi:10.1007/s13278-014-0189-1.

    Article  Google Scholar 

  9. Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of ACM, 50(10), 94–100.

    Article  Google Scholar 

  10. Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. I., & Tseng, B. L. (2008). Detecting splogs via temporal dynamics using self-similarity analysis. ACM Transactions on the Web, 2(1), 4. doi:10.1145/1326561.1326565.

    Article  Google Scholar 

  11. Boyd, D., & Heer, J. (2006) Profiles as conversation: Networked identity performance on friendster. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Koloa, Hawaii (Vol. 3, pp. 59c-59c). Piscataway: IEEE Computer Society

  12. Brown, G., Howe, T., Ihbe, M., Prakash, A., & Borders, K. (2008). Social networks and context-aware spam. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, New York (pp. 403–412, CSCW ‘08): Association of Computing Machinery (ACM). doi:10.1145/1460563.1460628.

  13. Zinman, A., & Donath, J. (2007). Is Britney Spears spam? In Paper presented at the 4th Conference on Email and Anti-Spam, Mountain View, California.

  14. Harold, & Nguyen (2014). 2013 State of Social Media Spam Report (2013 Research Report ed., pp. 21). Burlingame, California: Nexgate.

  15. Grier, C., Thomas, K., Paxson, V., & Zhang, M. (2010) @spam: the underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, New York (Vol. Chicago, Illinois, pp. 27–37): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1866307.1866311.

  16. Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-Business websites. Information & Management, 51(7), 845–853.

    Article  Google Scholar 

  17. Ensing, & David (2013). Money talks and listens: Characteristics of rating and review site users. Maritz Research’s White Papers, 4

  18. IC3 (2008). 2008 Internet Crime Report (p. 25): Internet Crime Complaint Center.

  19. Reviews, reputation, and revenue: The case of Yelp.com (2011). Harvard Business School, Boston College. http://www.hbs.edu/faculty/Publication%20Files/12-016_0464f20e-35b2-492e-a328-fb14a325f718.pdf.

  20. Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings on the Conference on Empirical Methods in Natural Language Processing, Singapore (pp. 248–256): Association for Computational Linguistics

  21. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  22. Markines, B., Cattuto, C., & Menczer, F. (2009). Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, New York (pp. 41–48, AIRWeb ‘09): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1531914.1531924.

  23. Lee, K., Caverlee, J., & Webb, S. (2010). Uncovering social spammers: social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York (pp. 435–442): Association of Computing Machinery (ACM). doi:10.1145/1835449.1835522.

  24. Jin, X., Lin, C., Luo, J., & Han, J. (2011). A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12), 1458–1461.

    Google Scholar 

  25. Lin, L., & Kun, J. (2012). Detecting spam in Chinese microblogs: A study on Sina Weibo. In Proceedings of the 8th International Conference on Computational Intelligence and Security, Guangzhou, Guangdong Province (pp. 578–581): China Printing Solutions. doi:10.1109/cis.2012.135.

  26. Dae-Ha, P., Eun-Ae, C., & Byung-Won, O. (2013). Social spam discovery using bayesian network classifiers based on feature extractions. In Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Australia, July 2013 (pp. 1808–1811) Piscataway: IEEE

  27. Po-Ching, L., & Po-Min, H. (2013). A study of effective features for detecting long-surviving Twitter spam accounts. In Proceedings of the 15th International Conference on Advanced Communication Technology, PyeongChang, South Korea, Jan 2013 (pp. 841–846). Piscataway: IEEE

  28. Sureka, A. (2011). Mining user comment activity for detecting forum spammers in Youtube. Paper presented at the 1st International Workshop on Usage Analysis and the Web of Data, Hyderabad, India

  29. Brody, S., & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Los Angeles, California (pp. 804–812): Association for Computational Linguistics

  30. Liu, B., Liu, L., Tsykin, A., Goodall, G. J., Green, J. E., Zhu, M., et al. (2010). Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics, 26(24), 3105–3111.

    Article  Google Scholar 

  31. Wang, C., Blei, D., & Li, F.-F. (2009). Simultaneous image classification and annotation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL (pp. 1903–1910). Piscataway: IEEE

  32. Bíró, I., Szabó, J., & Benczúr, A. A. (2008). Latent dirichlet allocation in web spam filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing (pp. 29–32). New York: Association of Computing Machinery (ACM)

  33. Cui, K., Zhou, B., Jia, Y., & Liang, Z. (2010). LDA-based model for online topic evolution mining. Computer Science, 37(11), 156–193.

    Google Scholar 

  34. Sizov, S. (2010). Geofolk: Latent spatial semantics in web 2.0 social media. In Proceedings of the third ACM international conference on Web search and data mining (pp. 281–290). New York: ACM

  35. Geng, X., & Smith-Miles, K. (2009). Incremental learning. In S. Li & A. Jain (Eds.), Encyclopedia of biometrics (pp. 731–735). Berlin: Springer.

    Google Scholar 

  36. Mitchell, T. M. (1997). Machine learning. Boston: McGraw-Hill.

    Google Scholar 

  37. Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226.

    Article  Google Scholar 

  38. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.

    Google Scholar 

  39. Utgoff, P. E. (1988). Id5: An incremental id3. In Proceedings of 5th International Workshop on Machine Learning, Ann Arbor, Michigan (pp. 107–120). Burlington, MA: Morgan Kaufmann

  40. Martinez, C., & Tony, G.-C. (1995). ILA: Combining inductive learning with prior knowledge and reasoning. 17

  41. Tsai, C. H., Lin, C. Y., & Lin, C. J. (2014). Incremental and decremental training for linear classification. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York (pp. 343–352). New York: Association of Computing Machinery (ACM)

  42. Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2), 829–855.

    Article  Google Scholar 

  43. Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill.

    Google Scholar 

  44. Aphinyanaphongs, Y., Fu, L. D., Li, Z., Peskin, E. R., Efstathiadis, E., Aliferis, C. F., et al. (2014). A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology, 65(10), 1964–1987.

    Article  Google Scholar 

  45. Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2), 270–285.

    Article  Google Scholar 

  46. Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 1997 (pp. 143–151). San Francisco: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  47. Soucy, P., & Mineau, G. W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Joint Conferences on Artificial Intelligence, Edinburgh, Scotland (Vol. 5, pp. 1130–1135): IJCAI Organization

  48. Singhal, A., Choi, J., Hindle, D., Lewis, D. D., & Pereira, F. (1999). AT&T at TREC-7. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg, MD (pp. 239–252): National Institute of Standards and Technology (NIST)

  49. Alexandrov, M., Gelbukh, A. F., & Lozovoi, G. (2001) Chi square classifier for document categorization. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City (Vol. 2004, pp. 457–459). Belin: Springer

  50. Dunham, M. H., & Ming, D. (2003). Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.

    Google Scholar 

  51. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7–8), 1289–1305.

    Google Scholar 

  52. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic arabic text classification. In Paper presented at the 9th International Conference on the Statistical Analysis of Textual Data, Lyon.

  53. Mesleh, A Md. (2007). Chi square feature extraction based svms arabic text categorization system. Journal of Computer Science, 3(6), 430–435.

    Article  Google Scholar 

  54. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  55. Halliday, M. A., & Matthiessen, C. M. (2004). An introduction to functional grammar. New York: Routledge.

    Google Scholar 

  56. Fairclough, N. (2003). Analysing discourse: Textual analysis for social research. London: Routledge.

    Google Scholar 

  57. Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.

    Google Scholar 

  58. Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.

    Google Scholar 

  59. Duan, Z., Gopalan, K., & Yuan, X. (2011). An empirical study of behavioral characteristics of spammers: Findings and implications. Computer Communications, 34(14), 1764–1776. doi:10.1016/j.comcom.2011.03.015.

    Article  Google Scholar 

  60. Gao, H., Chen, Y., Lee, K., Palsetia, D., & Choudhary, A. N. (2012). Towards online spam filtering in social networks. In NDSS

  61. Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., & Zhao, B. Y. (2010). Detecting and characterizing social spam campaigns. In Paper presented at the Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, Melbourne.

  62. Chen, C., Wu, K., Srinivasan, V., & Zhang, X. (2013). Battling the internet water army: detection of hidden paid posters. In Paper presented at the Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara.

  63. Mukherjee, A., Liu, B., & Glance, N. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, 2012 (pp. 191–200). New York: ACM

  64. Song, J., Lee, S., & Kim, J. (2011). Spam filtering in twitter using sender-receiver relationship. In R. Sommer, D. Balzarotti, & G. Maier (Eds.), Recent advances in intrusion detection (Vol. 6961, pp. 301–317)., Lecture Notes in Computer Science Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  65. Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT) 2010 (pp. 1–10)

  66. Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1–4), 251–266.

    Article  Google Scholar 

  67. Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64(1), 100–118.

    Article  Google Scholar 

  68. Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316.

    Article  Google Scholar 

  69. Manaskasemsak, B., Jiarpakdee, J., & Rungsawang, A. (2014). Adaptive Learning Ant Colony Optimization for Web Spam Detection. In Computational Science and Its ApplicationsICCSA 2014 (Vol. 8584, pp. 642–653, Lecture Notes in Computer Science). Berlin: Springer.

  70. Congfu, X., Baojun, S., Yunbiao, C., & Weike, P. (2014). An adaptive fusion algorithm for spam detection. IEEE Intelligent Systems, 29(4), 2–8.

    Article  Google Scholar 

  71. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.

    Article  Google Scholar 

  72. Li, Y., & Long, P. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1–3), 361–387.

    Article  Google Scholar 

  73. Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning, Banff, Alberta, Canada, 2004 (p. 116). New York: Association of Computing Machinery (ACM). doi:10.1145/1015330.1015332.

  74. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 3–30.

    Article  Google Scholar 

  75. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(3), 551–585.

    Google Scholar 

  76. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA.

  77. O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Identifying discriminating network motifs in YouTube spam. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 521–529): Association for the Advancement of Artificial Intelligence

  78. O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Network analysis of recurring YouTube spam campaigns. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 531–534)

  79. Helft, M. (2008). Search ads come to YouTube. http://bits.blogs.nytimes.com/2008/10/13/search-ads-come-to-youtube/.

  80. YouTube (2013). Youtube: Statistics.

  81. Sivaselvan, B., & Gopalan, N. P. (2009). Data mining: Techniques and trends. New Delhi: Prentice-Hall.

    Google Scholar 

  82. Ahmed, S., & Mithun, F. (2004). Word stemming to enhance spam filtering. In Paper presented at the 1st Conference on Email and Anti-Spam, Mountain View, CA.

  83. Sculley, D. (2010) Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC (pp. 979–988). New York: Association of Computing Machinery (ACM)

  84. Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625.

    Article  Google Scholar 

  85. Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: Wiley.

    Google Scholar 

  86. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg (Vol. 1, pp. 309–319, HLT’11): Association for Computational Linguistics

Download references

Acknowledgments

This work was supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Projects: CityU 11502115), and the Shenzhen Municipal Science and Technology R&D Funding - Basic Research Program (Project No. JCYJ20140419115614350).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, L., Lau, R.Y.K., Kwok, R.CW. et al. Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection. Electron Commer Res 17, 51–81 (2017). https://doi.org/10.1007/s10660-016-9244-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10660-016-9244-5

Keywords

Navigation