skip to main content
10.1145/3093241.3093280acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccdaConference Proceedingsconference-collections
research-article

Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks

Published:19 May 2017Publication History

ABSTRACT

Cybercriminals are increasingly using Internet messaging to exploit their victims. We develop and apply a text-based deception detection approach to build hybrid models for detecting cybercrime in the text Internet communications from native and non-native English speaking cybercriminal networks, where our models use both computational linguistics (CL) and psycholinguistic (PL) features. We study four types of deception-based cybercrime: fraud, scam, favorable fake reviews, and unfavorable fake reviews. We build two types of generalized hybrid models for both native and non-native English speaking cybercriminal networks: 2-dataset and 3-dataset hybrid models using Naïve Bayes, Support Vector Machines, and kth Nearest Neighbor algorithms. All 2-dataset models are trained on any two forms of cybercrime in different web genres, which are then used to detect and analyze other types of cybercrime in web genres that were not part of the training set to establish model generalizability. Similarly, the 3-dataset models are trained on any three forms of cybercrime in different web genres, that are also used to detect and analyze cybercrime in a web genre that was not part of the training set. Model performance on the test datasets ranges from 60% to 80% accuracy, with the best performance on detection of unfavorable reviews and fraud, and notable differences emerged between detection in messages from native and non-native English speaking groups. Our work may be applied as provider- or user-based filtering tools to identify cybercriminal actors and block or label undesirable messages before they reach their intended targets.

References

  1. Afroz, S., Brennan, M. and Greenstadt, R. 2012. Detecting Hoaxes, Frauds, and Deception in Writing Style Online. 2012 IEEE Symposium on Security and Privacy (SP) (May 2012), 461--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brennan, M., Afroz, S. and Greenstadt, R. 2012. Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (Nov. 2012), 12:1--12:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chang, C. and Lin, C.-J. 2001. LIBSVM: a Library for Support Vector Machines.Google ScholarGoogle Scholar
  4. Chen, X., Chandramouli, R. and Subbalakshmi, K.P. 2014. Scam detection in Twitter. Data Mining for Service. Springer. 133--150. Google ScholarGoogle ScholarCross RefCross Ref
  5. Chen, Y., Zhou, Y., Zhu, S. and Xu, H. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (Sep. 2012), 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Domingos, P. 2012. A Few Useful Things to Know About Machine Learning. Commun. ACM. 55, 10 (Oct. 2012), 78--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Enron Email Dataset: 2015. http://www.cs.cmu.edu/~enron/. Accessed: 2016--03-29.Google ScholarGoogle Scholar
  8. Exploiting Verbal Markers of Deception Across Ethnic Lines: An Investigative Tool for Cross-Cultural Interviewing: 2015. https://leb.fbi.gov/2015/july/exploiting-verbal-markers-of-deception-across-ethnic-lines-an-investigative-tool-for-cross-cultural-interviewing. Accessed: 2016--11-27.Google ScholarGoogle Scholar
  9. Exploring Underweb forums: How cybercriminals communicate: http://www.techrepublic.com/blog/it-security/exploring-underweb-forums-how-cybercriminals-communicate/. Accessed: 2016--11-27.Google ScholarGoogle Scholar
  10. Feng, V.W. and Hirst, G. 2013. Detecting Deceptive Opinions with Profile Compatibility. International Joint Conference on Natural Language Processing (Nagoya, Japan, 2013), 338--346.Google ScholarGoogle Scholar
  11. Firte, L., Lemnaru, C. and Potolea, R. 2010. Spam detection filter using KNN algorithm and resampling. 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) (Aug. 2010), 27--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fitzpatrick, E., Bachenko, J. and Fornaciari, T. 2015. Automatic Detection of Verbal Deception. Synthesis Lectures on Human Language Technologies. 8, 3 (Sep. 2015), 1--119. Google ScholarGoogle ScholarCross RefCross Ref
  13. Former Enron CEO Jeffrey Skilling Resentenced to 168 Months for Fraud, Conspiracy Charges: 2013. https://www.justice.gov/opa/pr/former-enron-ceo-jeffrey-skilling-resentenced-168-months-fraud-conspiracy-charges. Accessed: 2017--04-02.Google ScholarGoogle Scholar
  14. Hancock, J.T., Curry, L.E., Goorha, S. and Woodworth, M. 2007. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Processes. 45, 1 (Dec. 2007), 1--23. Google ScholarGoogle ScholarCross RefCross Ref
  15. ISIS has mastered a crucial recruiting tactic no terrorist group has ever conquered: 2015. http://www.businessinsider.com/isis-is-revolutionizing-international-terrorism-2015-5. Accessed: 2016--03-16.Google ScholarGoogle Scholar
  16. Keila, P.S. and Skillicorn, D.B. 2005. Detecting Unusual Email Communication. Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research (Toranto, Ontario, Canada, 2005), 117--125.Google ScholarGoogle Scholar
  17. Matykiewicz, P. and Pestian, J. 2012. Effect of Small Sample Size on Text Categorization with Support Vector Machines. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Stroudsburg, PA, USA, 2012), 193--201.Google ScholarGoogle Scholar
  18. Mbaziira, A. and Jones, J. 2016. A Text-based Deception Detection Model for Cybercrime. International Conference on Technology and Management. (Jul. 2016).Google ScholarGoogle Scholar
  19. Newman, M.L., Pennebaker, J.W., Berry, D.S. and Richards, J.M. 2003. Lying words: predicting deception from linguistic styles. Personality & Social Psychology Bulletin. 29, 5 (May 2003), 665--675. Google ScholarGoogle ScholarCross RefCross Ref
  20. Ott, M., Choi, Y., Cardie, C. and Hancock, J.T. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (Stroudsburg, PA, USA, 2011), 309--319.Google ScholarGoogle Scholar
  21. Pearl, L. and Steyvers, M. 2012. Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC. 27, (2012), 183--196. Google ScholarGoogle ScholarCross RefCross Ref
  22. Reynolds, K., Kontostathis, A. and Edwards, L. 2011. Using Machine Learning to Detect Cyberbullying. 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA) (Dec. 2011), 241--244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sarvari, H., Abozinadah, E., Mbaziira, A. and McCoy, D. 2014. Constructing and Analyzing Criminal Networks. IEEE Security and Privacy Workshops. (2014), 8.Google ScholarGoogle Scholar
  24. Shojaee, S., Murad, M.A.A., Azman, A.B., Sharef, N.M. and Nadali, S. 2013. Detecting deceptive reviews using lexical and syntactic features. 2013 13th International Conference on Intelligent Systems Design and Applications (ISDA) (Dec. 2013), 53--58.Google ScholarGoogle Scholar
  25. Tan, P.-N., Steinbach, M. and Kumar, V. 2014. Introduction to Data Mining. Dorling Kindersley.Google ScholarGoogle Scholar
  26. Tausczik, Y.R. and Pennebaker, J.W. 2010. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology. 29, 1 (Mar. 2010), 24--54. Google ScholarGoogle Scholar
  27. The digital language divide: 2014. http://labs.theguardian.com/digital-language-divide/. Accessed: 2016--11-27.Google ScholarGoogle Scholar
  28. Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T. and Nunamaker, J.F. 2004. A Comparison of Classification Methods for Predicting Deception in Computer-Mediated Communication. Journal of Management Information Systems. 20, 4 (2004), 139--165. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis
          May 2017
          307 pages
          ISBN:9781450352413
          DOI:10.1145/3093241

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 May 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader