research-article

Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks

Authors:
Alex V. Mbaziira

George Mason University, Fairfax, VA

George Mason University, Fairfax, VA
View Profile

,
James H. Jones

George Mason University, Fairfax, VA

George Mason University, Fairfax, VA
View Profile

ICCDA '17: Proceedings of the International Conference on Compute and Data AnalysisMay 2017Pages 23–27https://doi.org/10.1145/3093241.3093280

Published:19 May 2017Publication History

ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis

Pages 23–27

ABSTRACT

Cybercriminals are increasingly using Internet messaging to exploit their victims. We develop and apply a text-based deception detection approach to build hybrid models for detecting cybercrime in the text Internet communications from native and non-native English speaking cybercriminal networks, where our models use both computational linguistics (CL) and psycholinguistic (PL) features. We study four types of deception-based cybercrime: fraud, scam, favorable fake reviews, and unfavorable fake reviews. We build two types of generalized hybrid models for both native and non-native English speaking cybercriminal networks: 2-dataset and 3-dataset hybrid models using Naïve Bayes, Support Vector Machines, and kth Nearest Neighbor algorithms. All 2-dataset models are trained on any two forms of cybercrime in different web genres, which are then used to detect and analyze other types of cybercrime in web genres that were not part of the training set to establish model generalizability. Similarly, the 3-dataset models are trained on any three forms of cybercrime in different web genres, that are also used to detect and analyze cybercrime in a web genre that was not part of the training set. Model performance on the test datasets ranges from 60% to 80% accuracy, with the best performance on detection of unfavorable reviews and fraud, and notable differences emerged between detection in messages from native and non-native English speaking groups. Our work may be applied as provider- or user-based filtering tools to identify cybercriminal actors and block or label undesirable messages before they reach their intended targets.

References

Afroz, S., Brennan, M. and Greenstadt, R. 2012. Detecting Hoaxes, Frauds, and Deception in Writing Style Online. 2012 IEEE Symposium on Security and Privacy (SP) (May 2012), 461--475. Google ScholarDigital Library
Brennan, M., Afroz, S. and Greenstadt, R. 2012. Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (Nov. 2012), 12:1--12:22.Google ScholarDigital Library
Chang, C. and Lin, C.-J. 2001. LIBSVM: a Library for Support Vector Machines.Google Scholar
Chen, X., Chandramouli, R. and Subbalakshmi, K.P. 2014. Scam detection in Twitter. Data Mining for Service. Springer. 133--150. Google ScholarCross Ref
Chen, Y., Zhou, Y., Zhu, S. and Xu, H. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (Sep. 2012), 71--80. Google ScholarDigital Library
Domingos, P. 2012. A Few Useful Things to Know About Machine Learning. Commun. ACM. 55, 10 (Oct. 2012), 78--87. Google ScholarDigital Library
Enron Email Dataset: 2015. http://www.cs.cmu.edu/~enron/. Accessed: 2016--03-29.Google Scholar
Exploiting Verbal Markers of Deception Across Ethnic Lines: An Investigative Tool for Cross-Cultural Interviewing: 2015. https://leb.fbi.gov/2015/july/exploiting-verbal-markers-of-deception-across-ethnic-lines-an-investigative-tool-for-cross-cultural-interviewing. Accessed: 2016--11-27.Google Scholar
Exploring Underweb forums: How cybercriminals communicate: http://www.techrepublic.com/blog/it-security/exploring-underweb-forums-how-cybercriminals-communicate/. Accessed: 2016--11-27.Google Scholar
Feng, V.W. and Hirst, G. 2013. Detecting Deceptive Opinions with Profile Compatibility. International Joint Conference on Natural Language Processing (Nagoya, Japan, 2013), 338--346.Google Scholar
Firte, L., Lemnaru, C. and Potolea, R. 2010. Spam detection filter using KNN algorithm and resampling. 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) (Aug. 2010), 27--33. Google ScholarDigital Library
Fitzpatrick, E., Bachenko, J. and Fornaciari, T. 2015. Automatic Detection of Verbal Deception. Synthesis Lectures on Human Language Technologies. 8, 3 (Sep. 2015), 1--119. Google ScholarCross Ref
Former Enron CEO Jeffrey Skilling Resentenced to 168 Months for Fraud, Conspiracy Charges: 2013. https://www.justice.gov/opa/pr/former-enron-ceo-jeffrey-skilling-resentenced-168-months-fraud-conspiracy-charges. Accessed: 2017--04-02.Google Scholar
Hancock, J.T., Curry, L.E., Goorha, S. and Woodworth, M. 2007. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Processes. 45, 1 (Dec. 2007), 1--23. Google ScholarCross Ref
ISIS has mastered a crucial recruiting tactic no terrorist group has ever conquered: 2015. http://www.businessinsider.com/isis-is-revolutionizing-international-terrorism-2015-5. Accessed: 2016--03-16.Google Scholar
Keila, P.S. and Skillicorn, D.B. 2005. Detecting Unusual Email Communication. Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research (Toranto, Ontario, Canada, 2005), 117--125.Google Scholar
Matykiewicz, P. and Pestian, J. 2012. Effect of Small Sample Size on Text Categorization with Support Vector Machines. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Stroudsburg, PA, USA, 2012), 193--201.Google Scholar
Mbaziira, A. and Jones, J. 2016. A Text-based Deception Detection Model for Cybercrime. International Conference on Technology and Management. (Jul. 2016).Google Scholar
Newman, M.L., Pennebaker, J.W., Berry, D.S. and Richards, J.M. 2003. Lying words: predicting deception from linguistic styles. Personality & Social Psychology Bulletin. 29, 5 (May 2003), 665--675. Google ScholarCross Ref
Ott, M., Choi, Y., Cardie, C. and Hancock, J.T. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (Stroudsburg, PA, USA, 2011), 309--319.Google Scholar
Pearl, L. and Steyvers, M. 2012. Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC. 27, (2012), 183--196. Google ScholarCross Ref
Reynolds, K., Kontostathis, A. and Edwards, L. 2011. Using Machine Learning to Detect Cyberbullying. 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA) (Dec. 2011), 241--244.Google ScholarDigital Library
Sarvari, H., Abozinadah, E., Mbaziira, A. and McCoy, D. 2014. Constructing and Analyzing Criminal Networks. IEEE Security and Privacy Workshops. (2014), 8.Google Scholar
Shojaee, S., Murad, M.A.A., Azman, A.B., Sharef, N.M. and Nadali, S. 2013. Detecting deceptive reviews using lexical and syntactic features. 2013 13th International Conference on Intelligent Systems Design and Applications (ISDA) (Dec. 2013), 53--58.Google Scholar
Tan, P.-N., Steinbach, M. and Kumar, V. 2014. Introduction to Data Mining. Dorling Kindersley.Google Scholar
Tausczik, Y.R. and Pennebaker, J.W. 2010. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology. 29, 1 (Mar. 2010), 24--54. Google Scholar
The digital language divide: 2014. http://labs.theguardian.com/digital-language-divide/. Accessed: 2016--11-27.Google Scholar
Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T. and Nunamaker, J.F. 2004. A Comparison of Classification Methods for Predicting Deception in Computer-Mediated Communication. Journal of Management Information Systems. 20, 4 (2004), 139--165. Google ScholarDigital Library

Index Terms

Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Social and professional topics
  1. Computing / technology policy
    1. Computer crime

Recommendations

The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners
Highlights
- We compared native English and non-native (Dutch) Lombard and plain speech.
- ...
Abstract
Speech produced in noise (Lombard speech) is more intelligible than speech produced in quiet (plain speech). Previous research on the Lombard intelligibility benefit focused almost entirely on how native speakers produce and perceive ...
Read More
Non-native English speech recognition using bilingual English lexicon and acoustic models
ICME '03: Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03

This paper proposes an English speech recognition system which can recognize both non-native (i.e. Japanese) and native English speaker's pronunciation of English speech. The system uses a bilingual pronunciation lexicon in which each word has both ...
Read More
English lexical stress produced by native (L1) Bengali speakers compared to native (L1) English speakers: an acoustic study

English lexical stress is acoustically related to combination of duration, intensity, fundamental frequency (F0) and vowel quality. Errors in any or all of these correlates could interfere with production of the stress contrast, but it is unknown which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis
May 2017
307 pages
ISBN:9781450352413
DOI:10.1145/3093241

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cybercrime
computational linguistics
deception
machine learning
natural language processing
psycholinguistics
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 130
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks

ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners

Non-native English speech recognition using bilingual English lexicon and acoustic models

English lexical stress produced by native (L1) Bengali speakers compared to native (L1) English speakers: an acoustic study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hybrid Text-based Deception Models for Native and Non-Native English Cybercriminal Networks

ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners

Non-native English speech recognition using bilingual English lexicon and acoustic models

English lexical stress produced by native (L1) Bengali speakers compared to native (L1) English speakers: an acoustic study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media