Abstract
As the use of the Internet in organizations continues to grow, so does Internet abuse in the workplace. Internet abuse activities by employees-such as online chatting, gaming, investing, shopping, illegal downloading, pornography, and cybersex-and online crimes are inflicting severe costs to organizations in terms of productivity losses, resource wasting, security risks, and legal liabilities. Organizations have started to fight back via Internet usage policies, management training, and monitoring. Internet filtering software products are finding an increasing number of adoptions in organizations. These products mainly rely on blacklists, whitelists, and keyword/profile matching. In this paper, we propose a text mining approach to Internet abuse detection. We have empirically compared a variety of term weighting, feature selection, and classification techniques for Internet abuse detection in the workplace of software programmers. The experimental results are very promising; they demonstrate that the proposed approach would effectively complement the existing Internet filtering techniques.




Similar content being viewed by others
References
Anandarajan M, Simmers CA (2004) Constructive and dysfunctional personal web usage in the workplace: mapping employee attitudes. Idea Group
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive Bayesian and a memory-based approach. In: Proceedings of the 4th PKDD’s Workshop on Machine Learning and Textual Information Access
Apté C, Damerau F (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12(3):233–251
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading
Case CJ, Young KS (2002) Employee Internet management: current business practice and outcomes. CyberPsychol Behav 5(4):355–361
Chen R-C, Hsieh C-H (2006) Web page classification based on a support vector machine using a weighted vote schema. Expert Syst Appl 31(2):427–435
Chauvin Y, Rumelhart DE (1995) Backpropagation: theory, architectures, and applications. Lawrence Erlbaum Associates Mahwah
Davis RA, Flett GL, Besser A (2002) Validation of a new scale for measuring problematic Internet use: implications for pre-employment screening. CyberPsychol Behav 5(4):331–345
Fishbein M, Ajzen I (1975) Belief, attitude, intention, and behavior: an introduction to theory and research. Addison-Wesley, Reading
Fuhr N, Buckley C (1991) A Probabilistic learning approach for document indexing. ACM Trans Inf Syst 9(3):223–248
Galletta DF, Polak P (2003) An empirical investigation of antecedents of Internet abuse in the workplace. In: Proceedings of the second annual workshop on HCI research in MIS, Seattle, pp 47–51
Greenfield DN, Davis RA (2002) Lost in cyberspace: the web @ work. CyberPsychol Behav 5(4):347–353
Greenfield P, Rickwood P, Tran HC (2001) Effectiveness of Internet filtering software products. CSIRO Mathematical and Information Sciences
Griffiths M (2003) Internet abuse in the workplace: issues and concerns for employers and employment counselors. J Employ Couns 40(2):87–96
He Q, Chang K, Lim E-P (2007) Anticipatory event detection via classification. Inf Syst E-Bus Manage 5(3):275–294
Hunter CD (2000) Social impacts: Internet filter effectiveness-testing over-and underinclusive blocking decisions of four popular web filters. Soc Sci Comput Rev 18(2):214–222
Joachims T (1998) Text categorization with support vector machine: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp 137–142
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in Artificial Intelligence, pp 338–345
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on Artificial Intelligence, pp 1137–1143
Kwon O-W, Lee J-H (2003) Text categorization based on k-nearest neighbor approach for Web site classification. Inf Process Manage 39(1):25–44
Lee O, Lim, KH, Wong WM (2005) Why employees do non-work-related computing: an exploratory investigation through multiple theoretical perspectives. In: Proceeding of the 38th Hawaii international conference on system sciences, pp 185c
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR-92, 15th ACM international conference on research and development in information retrieval, pp 37–50
Lim KG (2002) The IT way of loafing on the job: cyberloafing, neutralizing and organizational justice. J Organ Behav 23(5):675–694
Lim VKG, Teo TSH, Loo GL (2002) How do I loaf here? Let me count the ways. Commun ACM 45(1):66–70
Mahatanankoon P (2006) Predicting Cyber-production deviance in the workplace. Int J Internet Enterp Manage 4(4):314–330
Mahatanankoon P, Anandarajan M, Igbaria M (2004) Development of a measure of personal web usage in the workplace. CyberPsychol Behav 7(1):93–104
Malachowski D (2005) Wasted time at work costing companies billions. http://salary.com
Mccallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, pp 41–48
Panko RR, Beh HG (2002) Monitoring for pornography and sexual harassment. Commun ACM 45(1):84–87
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Qin T, Burgoon JK, Nunamaker JF (2004) An exploratory study on promising cues in deception detection and application of decision tree. In: Proceedings of the thirty-seventh annual Hawaii international conference of system sciences
Quinlan JR (1993) C4. 5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco
Rennie J (2000) ifile: An application of machine learning to e-mail filtering, In: Proceedings of the KDD-2000 Text Mining Workshop, Boston
Resnick PJ, Hansen DL, Richardson RR (2004) Calculating error rates for filtering software. Commun ACM 47(9):67–71
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT, Cambridge, pp 318–362
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the AAAI’98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2000) Staking classifiers for anti-spam filtering of e-mail. In: Proceedings of the 6th conference on empirical methods in natural language processing, pp 44–50
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6(1):49–73
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Schneider K (2003) A comparison of event models for naive Bayes anti-spam e-mail filtering, In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Sharma SK, Gupta J (2003) Improving workers’ productivity and reducing Internet abuse. J Comput Inf Syst 44(2):74–78
Siau K, Nah F, Teng L (2002) Acceptable Internet use policy. Commun ACM 45(1):75–79
Sun A, Lim E, Ng W (2002) Web classification using support vector machine. In: Proceedings of the fourth international workshop on Web information and data management, pp 96–99
Triandis CH (1979) Values, attitudes and interpersonal behavior. In: Proceedings of Nebraska symposium on motivation: beliefs, attitudes and values, Lincoln, pp 195–259
Urbaczewski A, Jessup LM (2002) Does electronic monitoring of employee Internet usage work? Commun ACM 45(1):80–83
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Wong WM, Lee O, Lim, KH (2005) Managing non-work related computing within an organization: the effects of two disciplinary approaches on employees’ commitment to change. In: Proceedings of the ninth Pacific Asia conference on information systems, Bangkok, Thailand, pp 441–454
Woon IMY, Pee LG (2004) Behavioral factors affecting Internet abuse in the workplace—an empirical investigation. In: Proceedings of the third annual workshop on HCI research in MIS, Washington, pp 80–84
Wyatt K, Phillips JG (2005) Internet use and misuse in the workplace. In: Proceedings of the 19th conference of the computer–human interaction special interest group of Australia
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, pp 412–420
Young KS, Case CJ (2004) Internet abuse in the workplace: new trends in risk management. CyberPsychol Behav 7(1):105–111
Zhou L, Burgoon JK, Twitchell D, Qin T (2004) A comparison of classification methods for predicting deception in computer-mediated communication. J Manage Inf Syst 20(4):139–166
Author information
Authors and Affiliations
Corresponding author
Additional information
An earlier version of this paper appeared in the Proceedings of the Fifth Workshop on e-Business (WeB), Milwaukee, WI, 2006.
Appendix
Rights and permissions
About this article
Cite this article
Chou, CH., Sinha, A.P. & Zhao, H. A text mining approach to Internet abuse detection. Inf Syst E-Bus Manage 6, 419–439 (2008). https://doi.org/10.1007/s10257-007-0070-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10257-007-0070-0