Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

Song, Long; Lau, Raymond Yiu Keung; Kwok, Ron Chi-Wai; Mirkovski, Kristijan; Dou, Wenyu

doi:10.1007/s10660-016-9244-5

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

Published: 08 October 2016

Volume 17, pages 51–81, (2017)
Cite this article

Electronic Commerce Research Aims and scope Submit manuscript

Long Song ORCID: orcid.org/0000-0001-9874-8188¹,
Raymond Yiu Keung Lau¹,
Ron Chi-Wai Kwok¹,
Kristijan Mirkovski² &
…
Wenyu Dou³

1973 Accesses
24 Citations
Explore all metrics

Abstract

With the rise of social web, there has also been a great concern about the quality of user-generated content on social media sites (SMSs). Deceptive comments harm users’ trust in online social media and cause financial loss to firms. Previous studies use various features and classification algorithms to detect and filter social spam on several social media platforms. However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. Thus, the main contribution of this paper is design of a novel detection methodology that combines topic- and user-based features to improve the effectiveness of social spam detection. The proposed methodology exploits a probabilistic generative model, namely the labeled latent Dirichlet allocation (L-LDA), for mining the latent semantics from user-generated comments, and an incremental learning approach for tackling the changing feature space. An experiment based on a large dataset extracted from YouTube demonstrates the effectiveness of our proposed methodology, which achieves an average accuracy of 91.17 % in social spam detection. Our statistical analysis reveals that topic-based features significantly improve social spam detection, which has significant implications for business practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Approach for Detecting Spammers in Online Social Networks

AdaGraph: Adaptive Graph-Based Algorithms for Spam Detection in Social Networks

An unsupervised method for social network spammer detection based on user information interests

Article Open access 10 January 2022

Notes

http://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/.
http://www.internetlivestats.com/twitter-statistics/.
See “$611,000 fine as TripAdvisor gets bad review in Italy” by Barry Neild, Dec’14, available at http://edition.cnn.com/2014/12/23/travel/tripadvisor-fine/.
See “Fake online reviews: 4 ways companies can deceive you” by Megan Griffith-Greene, Nov’14, available at http://www.cbc.ca/news/business/fake-online-reviews-4-ways-companies-can-deceive-you-1.2825080.
A Chinese microblogging website (www.weibo.com).

References

van Marle, D. (2011) IP telephony shifts from unified communications to social media. In Proceedings of the 50th FITCE Congress, 2011 (pp. 1–4). Piscataway: IEEE
Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: What is new from databases perspective? In Big Data Analytics (pp. 42–61). Berlin: Springer.
Chandramouli, R. (2011). Emerging social media threats: Technology and policy perspectives. In Proceedings of the 2nd Worldwide Cybersecurity Summit (WCS), London (pp. 1–4). Piscataway: IEEE
Zhou, L., Wu, J., & Zhang, D. (2014). Discourse cues to deception in the case of multiple receivers. Information & Management, 51(6), 726–737. doi:10.1016/j.im.2014.05.011.
Article Google Scholar
Wu, G., Greene, D., Smyth, B., & Cunningham, P. A. (2010) Distortion as a validation criterion in the identification of suspicious reviews. In Proceedings of the 1st Workshop on Social Media Analytics, New York (pp. 10–13, SOMA ‘10): Association of Computing Machinery (ACM). doi:10.1145/1964858.1964860.
Yoo, K.-H., & Gretzel, U. (2009). Comparison of deceptive and truthful travel reviews. In W. Höpken, U. Gretzel, & R. Law (Eds.), Information and Communication Technologies in Tourism 2009 (pp. 37–47). Vienna: Springer.
Theft, fraud cost retailers $8 million a day: study. (2007), The Ottawa Citizen, pp. E.3-E3.
Wang, D., Irani, D., & Pu, C. (2014). SPADE: A social-spam analytics and detection framework. Social Network Analysis and Mining, 4(1), 1–18. doi:10.1007/s13278-014-0189-1.
Article Google Scholar
Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of ACM, 50(10), 94–100.
Article Google Scholar
Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. I., & Tseng, B. L. (2008). Detecting splogs via temporal dynamics using self-similarity analysis. ACM Transactions on the Web, 2(1), 4. doi:10.1145/1326561.1326565.
Article Google Scholar
Boyd, D., & Heer, J. (2006) Profiles as conversation: Networked identity performance on friendster. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Koloa, Hawaii (Vol. 3, pp. 59c-59c). Piscataway: IEEE Computer Society
Brown, G., Howe, T., Ihbe, M., Prakash, A., & Borders, K. (2008). Social networks and context-aware spam. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, New York (pp. 403–412, CSCW ‘08): Association of Computing Machinery (ACM). doi:10.1145/1460563.1460628.
Zinman, A., & Donath, J. (2007). Is Britney Spears spam? In Paper presented at the 4th Conference on Email and Anti-Spam, Mountain View, California.
Harold, & Nguyen (2014). 2013 State of Social Media Spam Report (2013 Research Report ed., pp. 21). Burlingame, California: Nexgate.
Grier, C., Thomas, K., Paxson, V., & Zhang, M. (2010) @spam: the underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, New York (Vol. Chicago, Illinois, pp. 27–37): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1866307.1866311.
Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-Business websites. Information & Management, 51(7), 845–853.
Article Google Scholar
Ensing, & David (2013). Money talks and listens: Characteristics of rating and review site users. Maritz Research’s White Papers, 4
IC3 (2008). 2008 Internet Crime Report (p. 25): Internet Crime Complaint Center.
Reviews, reputation, and revenue: The case of Yelp.com (2011). Harvard Business School, Boston College. http://www.hbs.edu/faculty/Publication%20Files/12-016_0464f20e-35b2-492e-a328-fb14a325f718.pdf.
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings on the Conference on Empirical Methods in Natural Language Processing, Singapore (pp. 248–256): Association for Computational Linguistics
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Google Scholar
Markines, B., Cattuto, C., & Menczer, F. (2009). Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, New York (pp. 41–48, AIRWeb ‘09): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1531914.1531924.
Lee, K., Caverlee, J., & Webb, S. (2010). Uncovering social spammers: social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York (pp. 435–442): Association of Computing Machinery (ACM). doi:10.1145/1835449.1835522.
Jin, X., Lin, C., Luo, J., & Han, J. (2011). A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12), 1458–1461.
Google Scholar
Lin, L., & Kun, J. (2012). Detecting spam in Chinese microblogs: A study on Sina Weibo. In Proceedings of the 8th International Conference on Computational Intelligence and Security, Guangzhou, Guangdong Province (pp. 578–581): China Printing Solutions. doi:10.1109/cis.2012.135.
Dae-Ha, P., Eun-Ae, C., & Byung-Won, O. (2013). Social spam discovery using bayesian network classifiers based on feature extractions. In Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Australia, July 2013 (pp. 1808–1811) Piscataway: IEEE
Po-Ching, L., & Po-Min, H. (2013). A study of effective features for detecting long-surviving Twitter spam accounts. In Proceedings of the 15th International Conference on Advanced Communication Technology, PyeongChang, South Korea, Jan 2013 (pp. 841–846). Piscataway: IEEE
Sureka, A. (2011). Mining user comment activity for detecting forum spammers in Youtube. Paper presented at the 1st International Workshop on Usage Analysis and the Web of Data, Hyderabad, India
Brody, S., & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Los Angeles, California (pp. 804–812): Association for Computational Linguistics
Liu, B., Liu, L., Tsykin, A., Goodall, G. J., Green, J. E., Zhu, M., et al. (2010). Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics, 26(24), 3105–3111.
Article Google Scholar
Wang, C., Blei, D., & Li, F.-F. (2009). Simultaneous image classification and annotation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL (pp. 1903–1910). Piscataway: IEEE
Bíró, I., Szabó, J., & Benczúr, A. A. (2008). Latent dirichlet allocation in web spam filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing (pp. 29–32). New York: Association of Computing Machinery (ACM)
Cui, K., Zhou, B., Jia, Y., & Liang, Z. (2010). LDA-based model for online topic evolution mining. Computer Science, 37(11), 156–193.
Google Scholar
Sizov, S. (2010). Geofolk: Latent spatial semantics in web 2.0 social media. In Proceedings of the third ACM international conference on Web search and data mining (pp. 281–290). New York: ACM
Geng, X., & Smith-Miles, K. (2009). Incremental learning. In S. Li & A. Jain (Eds.), Encyclopedia of biometrics (pp. 731–735). Berlin: Springer.
Google Scholar
Mitchell, T. M. (1997). Machine learning. Boston: McGraw-Hill.
Google Scholar
Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226.
Article Google Scholar
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.
Google Scholar
Utgoff, P. E. (1988). Id5: An incremental id3. In Proceedings of 5th International Workshop on Machine Learning, Ann Arbor, Michigan (pp. 107–120). Burlington, MA: Morgan Kaufmann
Martinez, C., & Tony, G.-C. (1995). ILA: Combining inductive learning with prior knowledge and reasoning. 17
Tsai, C. H., Lin, C. Y., & Lin, C. J. (2014). Incremental and decremental training for linear classification. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York (pp. 343–352). New York: Association of Computing Machinery (ACM)
Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2), 829–855.
Article Google Scholar
Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill.
Google Scholar
Aphinyanaphongs, Y., Fu, L. D., Li, Z., Peskin, E. R., Efstathiadis, E., Aliferis, C. F., et al. (2014). A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology, 65(10), 1964–1987.
Article Google Scholar
Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2), 270–285.
Article Google Scholar
Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 1997 (pp. 143–151). San Francisco: Morgan Kaufmann Publishers Inc.
Google Scholar
Soucy, P., & Mineau, G. W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Joint Conferences on Artificial Intelligence, Edinburgh, Scotland (Vol. 5, pp. 1130–1135): IJCAI Organization
Singhal, A., Choi, J., Hindle, D., Lewis, D. D., & Pereira, F. (1999). AT&T at TREC-7. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg, MD (pp. 239–252): National Institute of Standards and Technology (NIST)
Alexandrov, M., Gelbukh, A. F., & Lozovoi, G. (2001) Chi square classifier for document categorization. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City (Vol. 2004, pp. 457–459). Belin: Springer
Dunham, M. H., & Ming, D. (2003). Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.
Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7–8), 1289–1305.
Google Scholar
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic arabic text classification. In Paper presented at the 9th International Conference on the Statistical Analysis of Textual Data, Lyon.
Mesleh, A Md. (2007). Chi square feature extraction based svms arabic text categorization system. Journal of Computer Science, 3(6), 430–435.
Article Google Scholar
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Halliday, M. A., & Matthiessen, C. M. (2004). An introduction to functional grammar. New York: Routledge.
Google Scholar
Fairclough, N. (2003). Analysing discourse: Textual analysis for social research. London: Routledge.
Google Scholar
Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.
Google Scholar
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
Google Scholar
Duan, Z., Gopalan, K., & Yuan, X. (2011). An empirical study of behavioral characteristics of spammers: Findings and implications. Computer Communications, 34(14), 1764–1776. doi:10.1016/j.comcom.2011.03.015.
Article Google Scholar
Gao, H., Chen, Y., Lee, K., Palsetia, D., & Choudhary, A. N. (2012). Towards online spam filtering in social networks. In NDSS
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., & Zhao, B. Y. (2010). Detecting and characterizing social spam campaigns. In Paper presented at the Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, Melbourne.
Chen, C., Wu, K., Srinivasan, V., & Zhang, X. (2013). Battling the internet water army: detection of hidden paid posters. In Paper presented at the Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara.
Mukherjee, A., Liu, B., & Glance, N. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, 2012 (pp. 191–200). New York: ACM
Song, J., Lee, S., & Kim, J. (2011). Spam filtering in twitter using sender-receiver relationship. In R. Sommer, D. Balzarotti, & G. Maier (Eds.), Recent advances in intrusion detection (Vol. 6961, pp. 301–317)., Lecture Notes in Computer Science Berlin, Heidelberg: Springer.
Chapter Google Scholar
Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT) 2010 (pp. 1–10)
Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1–4), 251–266.
Article Google Scholar
Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64(1), 100–118.
Article Google Scholar
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316.
Article Google Scholar
Manaskasemsak, B., Jiarpakdee, J., & Rungsawang, A. (2014). Adaptive Learning Ant Colony Optimization for Web Spam Detection. In Computational Science and Its Applications—ICCSA 2014 (Vol. 8584, pp. 642–653, Lecture Notes in Computer Science). Berlin: Springer.
Congfu, X., Baojun, S., Yunbiao, C., & Weike, P. (2014). An adaptive fusion algorithm for spam detection. IEEE Intelligent Systems, 29(4), 2–8.
Article Google Scholar
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
Article Google Scholar
Li, Y., & Long, P. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1–3), 361–387.
Article Google Scholar
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning, Banff, Alberta, Canada, 2004 (p. 116). New York: Association of Computing Machinery (ACM). doi:10.1145/1015330.1015332.
Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 3–30.
Article Google Scholar
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(3), 551–585.
Google Scholar
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA.
O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Identifying discriminating network motifs in YouTube spam. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 521–529): Association for the Advancement of Artificial Intelligence
O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Network analysis of recurring YouTube spam campaigns. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 531–534)
Helft, M. (2008). Search ads come to YouTube. http://bits.blogs.nytimes.com/2008/10/13/search-ads-come-to-youtube/.
YouTube (2013). Youtube: Statistics.
Sivaselvan, B., & Gopalan, N. P. (2009). Data mining: Techniques and trends. New Delhi: Prentice-Hall.
Google Scholar
Ahmed, S., & Mithun, F. (2004). Word stemming to enhance spam filtering. In Paper presented at the 1st Conference on Email and Anti-Spam, Mountain View, CA.
Sculley, D. (2010) Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC (pp. 979–988). New York: Association of Computing Machinery (ACM)
Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625.
Article Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: Wiley.
Google Scholar
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg (Vol. 1, pp. 309–319, HLT’11): Association for Computational Linguistics

Download references

Acknowledgments

This work was supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Projects: CityU 11502115), and the Shenzhen Municipal Science and Technology R&D Funding - Basic Research Program (Project No. JCYJ20140419115614350).

Author information

Authors and Affiliations

Department of Information Systems, College of Business, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, People’s Republic of China
Long Song, Raymond Yiu Keung Lau & Ron Chi-Wai Kwok
School of Information Management, Victoria Business School, Victoria University of Wellington, 23 Lambton Quay, Wellington, New Zealand
Kristijan Mirkovski
Department of Marketing, College of Business, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, People’s Republic of China
Wenyu Dou

Authors

Long Song
View author publications
You can also search for this author in PubMed Google Scholar
Raymond Yiu Keung Lau
View author publications
You can also search for this author in PubMed Google Scholar
Ron Chi-Wai Kwok
View author publications
You can also search for this author in PubMed Google Scholar
Kristijan Mirkovski
View author publications
You can also search for this author in PubMed Google Scholar
Wenyu Dou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Long Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, L., Lau, R.Y.K., Kwok, R.CW. et al. Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection. Electron Commer Res 17, 51–81 (2017). https://doi.org/10.1007/s10660-016-9244-5

Download citation

Published: 08 October 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10660-016-9244-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

Abstract

Access this article

Similar content being viewed by others

A Hybrid Approach for Detecting Spammers in Online Social Networks

AdaGraph: Adaptive Graph-Based Algorithms for Spam Detection in Social Networks

An unsupervised method for social network spammer detection based on user information interests

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

Abstract

Access this article

Similar content being viewed by others

A Hybrid Approach for Detecting Spammers in Online Social Networks

AdaGraph: Adaptive Graph-Based Algorithms for Spam Detection in Social Networks

An unsupervised method for social network spammer detection based on user information interests

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation