Skip to main content
Log in

Email mining: tasks, common techniques, and tools

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Email is one of the most popular forms of communication nowadays, mainly due to its efficiency, low cost, and compatibility of diversified types of information. In order to facilitate better usage of emails and explore business potentials in emailing, various data mining techniques have been applied on email data. In this paper, we present a brief survey of the major research efforts on email mining. To emphasize the differences between email mining and general text mining, we organize our survey on five major email mining tasks, namely spam detection, email categorization, contact analysis, email network property analysis and email visualization. Those tasks are inherently incorporated into various usages of emails. We systematically review the commonly used techniques and also discuss the related software tools available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Androutsopoulos I, Koutsias J, Chandrinos K, Spyropoulos C, (2000) An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages, In: Proceedings of the 23rd annual international special interest group on information retrieval (SIGIR) conference on research and development in information retrieval, SIGIR’00, ACM, New York, NY, USA, pp 160–167

  2. Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos, C, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. Computing Research Repository (CoRR) cs.CL/0009009

  3. Bälter O (2000) Keystroke level analysis of email message organization. In: Proceedings of the SIGCHI conference on Human factors in computing systems, CHI’00, ACM, New York, NY, USA, pp 105–112

  4. Bellotti V, Ducheneaut N, Howard M, Smith I, Grinter RE (2005) Quality versus quantity: e-mail-centric task management and its relation with overload. Hum Comput Interact 20:89–138

    Article  Google Scholar 

  5. Bickel S, Scheffer T (2004) Learning from message pairs for automatic email answering. In: Proceedings of the European conference on machine learning (ECML), pp 87–98

  6. Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006a), Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories, MSR’06, ACM, New York, NY, USA, pp 137–143

  7. Bird C, Gourley A, Devanbu P, Gertz M , Swaminathan A (2006b) Mining email social networks in postgres. In: Proceedings of the 2006 international workshop on mining software repositories, MSR’06, ACM, New York, NY, USA, pp 185–186

  8. Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29:63–92

    Article  Google Scholar 

  9. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  10. Boykin PO, Roychowdhury VP (2004) Personal email networks: an effective anti-spam tool. Computing Research Repository (CoRR) cond-mat/0402143

  11. Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30:1145–1159

    Article  Google Scholar 

  12. Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  MATH  Google Scholar 

  13. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification regression trees, 1st edn. Wadsworth and Brooks, Monterey, CA

    MATH  Google Scholar 

  14. Brutlag JD, Meek C (2000) Challenges of the email domain for text classification. In: Proceedings of the seventeenth international conference on machine learning, ICML’00, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 103–110

  15. Campbell CS, Maglio PP, Cozzi A, Dom B (2003) Expertise identification using email communications. In: Proceedings of the twelfth international conference on Information and knowledge management, CIKM’03, ACM, New York, NY, USA, pp 528–531

  16. Carvalho VR, Cohen WW (2008) Ranking users for intelligent message addressing. In: Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR’08, Springer, Berlin, Heidelberg, pp 321–333

  17. Claburn T (2005) Spam costs billions. Website http://www.informationweek.com/news/59300834

  18. Cohen W (1996) Learning rules that classify e-mail. In: Papers from the association for the advancement of artificial intelligence (AAAI) spring symposium on machine learning in information access, AAAI Press, pp 18–25

  19. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international Conference on machine learning, Morgan Kaufmann, pp 115–123

  20. Cormack G, Lynam T (2004) A study of supervised spam detection applied to eight months of personal e-mail

  21. Cormack G, Lynam T (2005) Spam corpus creation for trec. In: Proceedings of the second conference on email and anti-spam (CEAS), Mountain View, CA

  22. Corney MW, Anderson AM, Mohay GM, de Vel O (2001) Identifying the authors of suspect email

  23. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  24. Cui Y, Pei J, Tang G, Jiang D, Luk W-S, Hua M (2011) Finding email correspondents in online social networks. World Wide Web J, 2012, Springer, Netherlands, pp 1–24

  25. Dabbish LA, Kraut RE (2006) Email overload at work: an analysis of factors associated with email strain. In: Proceedings of the (2006) 20th anniversary conference on computer supported cooperative work, CSCW’06. ACM, New York, NY, USA, pp 431–440

  26. De Choudhury M, Mason WA, Hofman JM, Watts DJ (2010) Inferring relevant social networks from interpersonal communication. In: Proceedings of the 19th international conference on World wide web, WWW’10, ACM, New York, NY, USA, pp 301–310

  27. de Vel O, Anderson A, Corney M, Mohay G (2001) Multi-topic e-mail authorship attribution forensics. In: Proceedings of the workshop on data mining for security applications, 8th ACM conference on computer security (CCS)

  28. Delaney KJ, Vara V (2007) Will social features make email sexy again? Wall Str J, (18 Oct)

  29. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175

    Article  MATH  Google Scholar 

  30. Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

    Article  Google Scholar 

  31. Ducheneaut N, Watts LA (2005) In search of coherence: a review of e-mail research. Hum Comput Interact 20:11–48

    Article  Google Scholar 

  32. Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41

    Article  Google Scholar 

  33. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826

    Article  MATH  MathSciNet  Google Scholar 

  34. Golbeck J, Hendler JA (2004) Reputation network analysis for email filtering. In: Proceedings of the first conference on email and anti-spam (CEAS), Mountain View, CA

  35. Golub GH, van Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD

    MATH  Google Scholar 

  36. Gomes LH, Castro FDO, Almeida RB, Bettencourt LMA, Almeida VAF, Almeida JM (2005) Improving spam detection based on structural similarity. Computing Research Repository (CoRR) abs/cs/0504012

  37. Gomez JC, Boiy E, Moens M-F (2012) Highly discriminative statistical features for email classification. Knowl Inf Syst 31(3):23–57

    Article  Google Scholar 

  38. Hőlzer R, Malin B, Sweeney L (2005) Email alias detection using social network analysis. In: Proceedings of the international conference on knowledge discovery and data mining (KDD) workshop on link discovery: issues, approaches, and applications, ACM Press

  39. Internet Threats Trend Report Q1 2010 (2010), Company Press

  40. Johansen L, Rowell M, Butler K, Mcdaniel P (2007) Email communities of interest. In: Proceedings of the fourth conference on email and anti-spam (CEAS), Mountain View, CA

  41. John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, UAI’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 338–345

  42. Jolliffe IT (1986) Principal component analysis. Springer, New York

    Book  Google Scholar 

  43. Karagiannis T, Vojnovic M (2009) Behavioral profiles for advanced email features. In: Proceedings of the 18th international conference on world wide web’, WWW’09, ACM, New York, NY, USA, pp 711–720

  44. Katakis I, Tsoumakas G, Vlahavas I (2007) Web data management practices: emerging techniques and technologies. IGI Publishing, Hershey, PA

    Google Scholar 

  45. Keila PS, Skillicorn DB (2005) Structure in the enron email data set. Comput Math Organ Theory 11:183–199

    Article  MATH  Google Scholar 

  46. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46:604–632

    Article  MATH  MathSciNet  Google Scholar 

  47. Klimt B, Yang Y (2004) The enron corpus: A new data set for email classification research. In: The European conference on machine learning (ECML), pp 217–226

  48. Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177:2167–2187

    Article  Google Scholar 

  49. Lam H-Y, Yeung D-Y (2007) A learning approach to spam detection based on social networks. In: Proceedings of the fourth conference on email and anti-spam (CEAS), Mountain View, CA

  50. Lockerd A, Selker T (2003) DriftCatcher: The implicit social context of email. In: Proceedings of the ninth IFIP TC13 international conference on human–computer interaction (INTERACT) 2003, pp 1–5

  51. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, University of California Press, pp 281–297

  52. McArthur R, Bruza P (2003) Discovery of implicit and explicit connections between people using email utterance. In: Proceedings of the eighth conference on European conference on computer supported cooperative work (ECSCW) 2003, Kluwer Academic Publishers, Norwell, MA, USA, pp 21–40

  53. Mccallum A, Corrada-emmanuel A, Wang X (2004) The author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email, Technical report, University of Massachusetts Amherst

  54. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop on learning for text categorization, AAAI Press, pp 41–48

  55. Myers JL, Well AD (2003) Research design and statistical analysis, 2nd edn. Lawrence Erlbaum, Hillsdale, NJ

    Google Scholar 

  56. Nagwani NK, Bhansali A (2010) An object oriented email clustering model using weighted similarities between emails attributes. Int J Res Rev Comput Sci 1(2):1–6

    Google Scholar 

  57. Neustaedter C, Brush AJB, Smith MA (2005) Beyond ”from” and ”received”: exploring the dynamics of email triage. In: ACM CHI’05 extended abstracts on human factors in computing systems, CHI EA’05, ACM, New York, NY, USA, pp 1977–1980

  58. Nucleus Research Inc. (2007) Spam, the repeat offender. Notes and reports

  59. Perer A, Smith MA (2006) Contrasting portraits of email practices: visual approaches to reflection and analysis. In: Proceedings of the working conference on advanced visual interfaces, AVI’06, ACM, New York, NY, USA, pp 389–395

  60. Radicati S, Hoang Q (2010) Email statistics report, 2011–2015. Company Press

  61. Rennie JDM (2000) Ifile: An application of machine learning to e-mail filtering. In: Proceedings of the international conference on knowledge discovery and data mining (KDD) workshop on text mining

  62. Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA

    Google Scholar 

  63. Rios G, Zha H (2004) Exploring support vector machines and random forests for spam detection. In: Proceedings of the first conference on email and anti-spam (CEAS), Mountain View, CA

  64. Roth M, Ben-David A, Deutscher D, Flysher G, Horn I, Leichtberg A, Leiser N, Matias Y, Merom R (2010) Suggesting friends using the implicit social graph. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’10, ACM, New York, NY, USA, pp 233–242

  65. Rowe R, Creamer G, Hershkop S, Stolfo SJ (2007) Automated social hierarchy detection through email network analysis. In: Proceedings of the 9th WebKDD and 1st SNA-KDD (2007) workshop on Web mining and social network analysis, WebKDD/SNA-KDD’07. ACM, New York, NY, USA, pp 109–117

  66. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail

  67. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, Inc., New York, NY

    Google Scholar 

  68. Salton G, Wong A, Yang CS (1997) A vector space model for automatic indexing. In: Sparck Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 273–280

    Google Scholar 

  69. Sasaki M, Shinnou H (2005) Spam detection using text clustering. In: Proceedings of the 2005 international conference on cyberworlds (CW), IEEE Computer Society, Washington, DC, USA, pp 316–319

  70. Scheffer T (2004) Email answering assistance by semi-supervised text classification. Intell Data Anal 8:481–493

    Google Scholar 

  71. Schwartz MF, Wood DCM (1993) Discovering shared interests using graph analysis. Commun ACM 36:78–89

    Article  Google Scholar 

  72. Segal RB, Kephart JO (1999) Mailcat: an intelligent assistant for organizing e-mail. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence’, AAAI’99/IAAI’99, American Association for Artificial Intelligence, Menlo Park, CA, USA, pp 925–926

  73. Silverman BW, Jones MC (1951) E. fix and j.l. hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation: commentary on fix and hodges (1951). Int Stat Rev/Revue Internationale de Statistique 57(3):233–238

    Google Scholar 

  74. Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London

    Google Scholar 

  75. Stolfo SJ, Hershkop S, Wang K, Nimeskern O, Hu C-W (2003a), A behavior-based approach to securing email systems. In: Proceedings of the Computer network security, second international workshop on mathematical methods, models, and architectures for computer network security, MMM-ACNS 2003, St. Petersburg, Russia, September 21–23, 2003 (Lecture Notes in Computer Science) vol 2776. Springer

  76. Stolfo SJ, Hershkop S, Wang K, Nimeskern O, Hu C-W (2003b), Behavior profiling of email. In: Proceedings of the 1st NSF/NIJ conference on intelligence and security informatics, ISI’03, Springer, Berlin, Heidelberg, pp 74–90

  77. Stuit M, Wortmann H (2012) Discovery and analysis of email-driven business processes. Inf Syst 37(2):142–168

    Google Scholar 

  78. Taylor B (2006) Sender reputation in a large webmail service. In: Proceedings of the third conference on email and anti-spam (CEAS), Mountain View, CA

  79. Techopedia.com (n.d.) Social network analysis (SNA). Website http://www.techopedia.com/definition/3205/social-network-analysis-sna

  80. Tyler JR, Wilkinson DM, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Communities and technologies, Kluwer, B.V., Deventer, The Netherlands, pp 81–96

  81. van Rijsbergen C, Robertson S, Porter M (1980) New models in probabilistic information retrieval

  82. Venolia GD, Neustaedter C (2003) Understanding sequence and reply relationships within email conversations: a mixed-model visualization. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI’03), ACM, New York, NY, USA, pp 361–368

  83. Viégas FB, Golder S, Donath J (2006) Visualizing email content: portraying relationships from conversational histories. In: Grinter R, Rodden T, Aoki P, Cutrell E, Jeffries R, Olson G (eds) Proceedings of the SIGCHI conference on human factors in computing systems, CHI’06, ACM, New York, NY, USA, pp 979–988

  84. Vleck TV (2001) The history of electronic mail. Website http://www.multicians.org/thvv/mail-history.html

  85. Wang M-F, Jheng S-L, Tsai M-F, Tang C-H (2011) Enterprise email classification based on social network features. In: Proceedings of the international conference on advances in social networks analysis and mining, 2011, IEEE Computer Society, Washington, DC, USA, pp 532–536

  86. Wang X-L, Cloete I (2005) Learning to classify email: a survey. In: Proceedings of the international conference on machine learning and, cybernetics, 2005, vol 9, pp 5716–5719

  87. Whittaker S, Sidner C (1996) Email overload: exploring personal information management of email. In: Proceedings of the special interest group on computer human interaction (SIGCHI) conference on Human factors in computing systems: common ground, CHI’96, ACM, New York, NY, USA, pp 276–283

  88. Whittaker S, Matthews T, Cerruti J, Badenes H, Tang J (2011) Am I wasting my time organizing email? A study of email refinding. In: Proceedings of the 2011 annual conference on human factors in computing systems, CHI’11. ACM, New York, NY, USA, pp 3449–3458

  89. Wikipedia (2012) E-mail spam. Website http://en.wikipedia.org/wiki/E-mail_spam

  90. Yang Y (2001) A study on thresholding strategies for text categorization. In: Proceedings of the 24th ACM international conference on research and development in information retrieval. ACM Press, pp 137–145

  91. Yarow J (2011) 107,000,000,000,000. Website http://www.businessinsider.com/internet-statistics-2011-1-2011-1

  92. Yoo S, Yang Y, Lin F, Moon I-C (2009) Mining social networks for personalized email prioritization. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’09, ACM, New York, NY, USA, pp 967–976

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their very useful comments and suggestions. This research is supported in part by an NSERC Discovery Grant, a BCFRST NRAS Endowment Research Team Program project and a GRAND NCE project. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, G., Pei, J. & Luk, WS. Email mining: tasks, common techniques, and tools. Knowl Inf Syst 41, 1–31 (2014). https://doi.org/10.1007/s10115-013-0658-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0658-2

Keywords

Navigation