ABSTRACT
Cybercriminals are increasingly using Internet messaging to exploit their victims. We develop and apply a text-based deception detection approach to build hybrid models for detecting cybercrime in the text Internet communications from native and non-native English speaking cybercriminal networks, where our models use both computational linguistics (CL) and psycholinguistic (PL) features. We study four types of deception-based cybercrime: fraud, scam, favorable fake reviews, and unfavorable fake reviews. We build two types of generalized hybrid models for both native and non-native English speaking cybercriminal networks: 2-dataset and 3-dataset hybrid models using Naïve Bayes, Support Vector Machines, and kth Nearest Neighbor algorithms. All 2-dataset models are trained on any two forms of cybercrime in different web genres, which are then used to detect and analyze other types of cybercrime in web genres that were not part of the training set to establish model generalizability. Similarly, the 3-dataset models are trained on any three forms of cybercrime in different web genres, that are also used to detect and analyze cybercrime in a web genre that was not part of the training set. Model performance on the test datasets ranges from 60% to 80% accuracy, with the best performance on detection of unfavorable reviews and fraud, and notable differences emerged between detection in messages from native and non-native English speaking groups. Our work may be applied as provider- or user-based filtering tools to identify cybercriminal actors and block or label undesirable messages before they reach their intended targets.
- Afroz, S., Brennan, M. and Greenstadt, R. 2012. Detecting Hoaxes, Frauds, and Deception in Writing Style Online. 2012 IEEE Symposium on Security and Privacy (SP) (May 2012), 461--475. Google ScholarDigital Library
- Brennan, M., Afroz, S. and Greenstadt, R. 2012. Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (Nov. 2012), 12:1--12:22.Google ScholarDigital Library
- Chang, C. and Lin, C.-J. 2001. LIBSVM: a Library for Support Vector Machines.Google Scholar
- Chen, X., Chandramouli, R. and Subbalakshmi, K.P. 2014. Scam detection in Twitter. Data Mining for Service. Springer. 133--150. Google ScholarCross Ref
- Chen, Y., Zhou, Y., Zhu, S. and Xu, H. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (Sep. 2012), 71--80. Google ScholarDigital Library
- Domingos, P. 2012. A Few Useful Things to Know About Machine Learning. Commun. ACM. 55, 10 (Oct. 2012), 78--87. Google ScholarDigital Library
- Enron Email Dataset: 2015. http://www.cs.cmu.edu/~enron/. Accessed: 2016--03-29.Google Scholar
- Exploiting Verbal Markers of Deception Across Ethnic Lines: An Investigative Tool for Cross-Cultural Interviewing: 2015. https://leb.fbi.gov/2015/july/exploiting-verbal-markers-of-deception-across-ethnic-lines-an-investigative-tool-for-cross-cultural-interviewing. Accessed: 2016--11-27.Google Scholar
- Exploring Underweb forums: How cybercriminals communicate: http://www.techrepublic.com/blog/it-security/exploring-underweb-forums-how-cybercriminals-communicate/. Accessed: 2016--11-27.Google Scholar
- Feng, V.W. and Hirst, G. 2013. Detecting Deceptive Opinions with Profile Compatibility. International Joint Conference on Natural Language Processing (Nagoya, Japan, 2013), 338--346.Google Scholar
- Firte, L., Lemnaru, C. and Potolea, R. 2010. Spam detection filter using KNN algorithm and resampling. 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) (Aug. 2010), 27--33. Google ScholarDigital Library
- Fitzpatrick, E., Bachenko, J. and Fornaciari, T. 2015. Automatic Detection of Verbal Deception. Synthesis Lectures on Human Language Technologies. 8, 3 (Sep. 2015), 1--119. Google ScholarCross Ref
- Former Enron CEO Jeffrey Skilling Resentenced to 168 Months for Fraud, Conspiracy Charges: 2013. https://www.justice.gov/opa/pr/former-enron-ceo-jeffrey-skilling-resentenced-168-months-fraud-conspiracy-charges. Accessed: 2017--04-02.Google Scholar
- Hancock, J.T., Curry, L.E., Goorha, S. and Woodworth, M. 2007. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Processes. 45, 1 (Dec. 2007), 1--23. Google ScholarCross Ref
- ISIS has mastered a crucial recruiting tactic no terrorist group has ever conquered: 2015. http://www.businessinsider.com/isis-is-revolutionizing-international-terrorism-2015-5. Accessed: 2016--03-16.Google Scholar
- Keila, P.S. and Skillicorn, D.B. 2005. Detecting Unusual Email Communication. Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research (Toranto, Ontario, Canada, 2005), 117--125.Google Scholar
- Matykiewicz, P. and Pestian, J. 2012. Effect of Small Sample Size on Text Categorization with Support Vector Machines. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Stroudsburg, PA, USA, 2012), 193--201.Google Scholar
- Mbaziira, A. and Jones, J. 2016. A Text-based Deception Detection Model for Cybercrime. International Conference on Technology and Management. (Jul. 2016).Google Scholar
- Newman, M.L., Pennebaker, J.W., Berry, D.S. and Richards, J.M. 2003. Lying words: predicting deception from linguistic styles. Personality & Social Psychology Bulletin. 29, 5 (May 2003), 665--675. Google ScholarCross Ref
- Ott, M., Choi, Y., Cardie, C. and Hancock, J.T. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (Stroudsburg, PA, USA, 2011), 309--319.Google Scholar
- Pearl, L. and Steyvers, M. 2012. Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC. 27, (2012), 183--196. Google ScholarCross Ref
- Reynolds, K., Kontostathis, A. and Edwards, L. 2011. Using Machine Learning to Detect Cyberbullying. 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA) (Dec. 2011), 241--244.Google ScholarDigital Library
- Sarvari, H., Abozinadah, E., Mbaziira, A. and McCoy, D. 2014. Constructing and Analyzing Criminal Networks. IEEE Security and Privacy Workshops. (2014), 8.Google Scholar
- Shojaee, S., Murad, M.A.A., Azman, A.B., Sharef, N.M. and Nadali, S. 2013. Detecting deceptive reviews using lexical and syntactic features. 2013 13th International Conference on Intelligent Systems Design and Applications (ISDA) (Dec. 2013), 53--58.Google Scholar
- Tan, P.-N., Steinbach, M. and Kumar, V. 2014. Introduction to Data Mining. Dorling Kindersley.Google Scholar
- Tausczik, Y.R. and Pennebaker, J.W. 2010. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology. 29, 1 (Mar. 2010), 24--54. Google Scholar
- The digital language divide: 2014. http://labs.theguardian.com/digital-language-divide/. Accessed: 2016--11-27.Google Scholar
- Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T. and Nunamaker, J.F. 2004. A Comparison of Classification Methods for Predicting Deception in Computer-Mediated Communication. Journal of Management Information Systems. 20, 4 (2004), 139--165. Google ScholarDigital Library
Index Terms
- Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks
Recommendations
The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners
Highlights- We compared native English and non-native (Dutch) Lombard and plain speech.
- ...
AbstractSpeech produced in noise (Lombard speech) is more intelligible than speech produced in quiet (plain speech). Previous research on the Lombard intelligibility benefit focused almost entirely on how native speakers produce and perceive ...
Non-native English speech recognition using bilingual English lexicon and acoustic models
ICME '03: Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03This paper proposes an English speech recognition system which can recognize both non-native (i.e. Japanese) and native English speaker's pronunciation of English speech. The system uses a bilingual pronunciation lexicon in which each word has both ...
English lexical stress produced by native (L1) Bengali speakers compared to native (L1) English speakers: an acoustic study
English lexical stress is acoustically related to combination of duration, intensity, fundamental frequency (F0) and vowel quality. Errors in any or all of these correlates could interfere with production of the stress contrast, but it is unknown which ...
Comments