Skip to main content
Log in

A text mining approach to Internet abuse detection

  • Original Article
  • Published:
Information Systems and e-Business Management Aims and scope Submit manuscript

Abstract

As the use of the Internet in organizations continues to grow, so does Internet abuse in the workplace. Internet abuse activities by employees-such as online chatting, gaming, investing, shopping, illegal downloading, pornography, and cybersex-and online crimes are inflicting severe costs to organizations in terms of productivity losses, resource wasting, security risks, and legal liabilities. Organizations have started to fight back via Internet usage policies, management training, and monitoring. Internet filtering software products are finding an increasing number of adoptions in organizations. These products mainly rely on blacklists, whitelists, and keyword/profile matching. In this paper, we propose a text mining approach to Internet abuse detection. We have empirically compared a variety of term weighting, feature selection, and classification techniques for Internet abuse detection in the workplace of software programmers. The experimental results are very promising; they demonstrate that the proposed approach would effectively complement the existing Internet filtering techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Anandarajan M, Simmers CA (2004) Constructive and dysfunctional personal web usage in the workplace: mapping employee attitudes. Idea Group

  • Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive Bayesian and a memory-based approach. In: Proceedings of the 4th PKDD’s Workshop on Machine Learning and Textual Information Access

  • Apté C, Damerau F (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12(3):233–251

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading

    Google Scholar 

  • Case CJ, Young KS (2002) Employee Internet management: current business practice and outcomes. CyberPsychol Behav 5(4):355–361

    Article  Google Scholar 

  • Chen R-C, Hsieh C-H (2006) Web page classification based on a support vector machine using a weighted vote schema. Expert Syst Appl 31(2):427–435

    Article  Google Scholar 

  • Chauvin Y, Rumelhart DE (1995) Backpropagation: theory, architectures, and applications. Lawrence Erlbaum Associates Mahwah

    Google Scholar 

  • Davis RA, Flett GL, Besser A (2002) Validation of a new scale for measuring problematic Internet use: implications for pre-employment screening. CyberPsychol Behav 5(4):331–345

    Article  Google Scholar 

  • Fishbein M, Ajzen I (1975) Belief, attitude, intention, and behavior: an introduction to theory and research. Addison-Wesley, Reading

    Google Scholar 

  • Fuhr N, Buckley C (1991) A Probabilistic learning approach for document indexing. ACM Trans Inf Syst 9(3):223–248

    Article  Google Scholar 

  • Galletta DF, Polak P (2003) An empirical investigation of antecedents of Internet abuse in the workplace. In: Proceedings of the second annual workshop on HCI research in MIS, Seattle, pp 47–51

  • Greenfield DN, Davis RA (2002) Lost in cyberspace: the web @ work. CyberPsychol Behav 5(4):347–353

    Article  Google Scholar 

  • Greenfield P, Rickwood P, Tran HC (2001) Effectiveness of Internet filtering software products. CSIRO Mathematical and Information Sciences

  • Griffiths M (2003) Internet abuse in the workplace: issues and concerns for employers and employment counselors. J Employ Couns 40(2):87–96

    Google Scholar 

  • He Q, Chang K, Lim E-P (2007) Anticipatory event detection via classification. Inf Syst E-Bus Manage 5(3):275–294

    Article  Google Scholar 

  • Hunter CD (2000) Social impacts: Internet filter effectiveness-testing over-and underinclusive blocking decisions of four popular web filters. Soc Sci Comput Rev 18(2):214–222

    Google Scholar 

  • Joachims T (1998) Text categorization with support vector machine: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp 137–142

  • John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in Artificial Intelligence, pp 338–345

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on Artificial Intelligence, pp 1137–1143

  • Kwon O-W, Lee J-H (2003) Text categorization based on k-nearest neighbor approach for Web site classification. Inf Process Manage 39(1):25–44

    Article  Google Scholar 

  • Lee O, Lim, KH, Wong WM (2005) Why employees do non-work-related computing: an exploratory investigation through multiple theoretical perspectives. In: Proceeding of the 38th Hawaii international conference on system sciences, pp 185c

  • Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR-92, 15th ACM international conference on research and development in information retrieval, pp 37–50

  • Lim KG (2002) The IT way of loafing on the job: cyberloafing, neutralizing and organizational justice. J Organ Behav 23(5):675–694

    Article  Google Scholar 

  • Lim VKG, Teo TSH, Loo GL (2002) How do I loaf here? Let me count the ways. Commun ACM 45(1):66–70

    Article  Google Scholar 

  • Mahatanankoon P (2006) Predicting Cyber-production deviance in the workplace. Int J Internet Enterp Manage 4(4):314–330

    Article  Google Scholar 

  • Mahatanankoon P, Anandarajan M, Igbaria M (2004) Development of a measure of personal web usage in the workplace. CyberPsychol Behav 7(1):93–104

    Article  Google Scholar 

  • Malachowski D (2005) Wasted time at work costing companies billions. http://salary.com

  • Mccallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, pp 41–48

  • Panko RR, Beh HG (2002) Monitoring for pornography and sexual harassment. Commun ACM 45(1):84–87

    Article  Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Google Scholar 

  • Qin T, Burgoon JK, Nunamaker JF (2004) An exploratory study on promising cues in deception detection and application of decision tree. In: Proceedings of the thirty-seventh annual Hawaii international conference of system sciences

  • Quinlan JR (1993) C4. 5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco

    Google Scholar 

  • Rennie J (2000) ifile: An application of machine learning to e-mail filtering, In: Proceedings of the KDD-2000 Text Mining Workshop, Boston

  • Resnick PJ, Hansen DL, Richardson RR (2004) Calculating error rates for filtering software. Commun ACM 47(9):67–71

    Article  Google Scholar 

  • Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT, Cambridge, pp 318–362

    Google Scholar 

  • Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the AAAI’98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62

  • Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2000) Staking classifiers for anti-spam filtering of e-mail. In: Proceedings of the 6th conference on empirical methods in natural language processing, pp 44–50

  • Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6(1):49–73

    Article  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523

    Article  Google Scholar 

  • Schneider K (2003) A comparison of event models for naive Bayes anti-spam e-mail filtering, In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  • Sharma SK, Gupta J (2003) Improving workers’ productivity and reducing Internet abuse. J Comput Inf Syst 44(2):74–78

    Google Scholar 

  • Siau K, Nah F, Teng L (2002) Acceptable Internet use policy. Commun ACM 45(1):75–79

    Article  Google Scholar 

  • Sun A, Lim E, Ng W (2002) Web classification using support vector machine. In: Proceedings of the fourth international workshop on Web information and data management, pp 96–99

  • Triandis CH (1979) Values, attitudes and interpersonal behavior. In: Proceedings of Nebraska symposium on motivation: beliefs, attitudes and values, Lincoln, pp 195–259

  • Urbaczewski A, Jessup LM (2002) Does electronic monitoring of employee Internet usage work? Commun ACM 45(1):80–83

    Article  Google Scholar 

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Wong WM, Lee O, Lim, KH (2005) Managing non-work related computing within an organization: the effects of two disciplinary approaches on employees’ commitment to change. In: Proceedings of the ninth Pacific Asia conference on information systems, Bangkok, Thailand, pp 441–454

  • Woon IMY, Pee LG (2004) Behavioral factors affecting Internet abuse in the workplace—an empirical investigation. In: Proceedings of the third annual workshop on HCI research in MIS, Washington, pp 80–84

  • Wyatt K, Phillips JG (2005) Internet use and misuse in the workplace. In: Proceedings of the 19th conference of the computer–human interaction special interest group of Australia

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, pp 412–420

  • Young KS, Case CJ (2004) Internet abuse in the workplace: new trends in risk management. CyberPsychol Behav 7(1):105–111

    Article  Google Scholar 

  • Zhou L, Burgoon JK, Twitchell D, Qin T (2004) A comparison of classification methods for predicting deception in computer-mediated communication. J Manage Inf Syst 20(4):139–166

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen-Huei Chou.

Additional information

An earlier version of this paper appeared in the Proceedings of the Fifth Workshop on e-Business (WeB), Milwaukee, WI, 2006.

Appendix

Appendix

Table 5

Table 5 A list of online news Web sites with sub sections used in the experiment

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chou, CH., Sinha, A.P. & Zhao, H. A text mining approach to Internet abuse detection. Inf Syst E-Bus Manage 6, 419–439 (2008). https://doi.org/10.1007/s10257-007-0070-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10257-007-0070-0

Keywords

Navigation