Abstract
Email is one of the most popular forms of communication nowadays, mainly due to its efficiency, low cost, and compatibility of diversified types of information. In order to facilitate better usage of emails and explore business potentials in emailing, various data mining techniques have been applied on email data. In this paper, we present a brief survey of the major research efforts on email mining. To emphasize the differences between email mining and general text mining, we organize our survey on five major email mining tasks, namely spam detection, email categorization, contact analysis, email network property analysis and email visualization. Those tasks are inherently incorporated into various usages of emails. We systematically review the commonly used techniques and also discuss the related software tools available.
Similar content being viewed by others
References
Androutsopoulos I, Koutsias J, Chandrinos K, Spyropoulos C, (2000) An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages, In: Proceedings of the 23rd annual international special interest group on information retrieval (SIGIR) conference on research and development in information retrieval, SIGIR’00, ACM, New York, NY, USA, pp 160–167
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos, C, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. Computing Research Repository (CoRR) cs.CL/0009009
Bälter O (2000) Keystroke level analysis of email message organization. In: Proceedings of the SIGCHI conference on Human factors in computing systems, CHI’00, ACM, New York, NY, USA, pp 105–112
Bellotti V, Ducheneaut N, Howard M, Smith I, Grinter RE (2005) Quality versus quantity: e-mail-centric task management and its relation with overload. Hum Comput Interact 20:89–138
Bickel S, Scheffer T (2004) Learning from message pairs for automatic email answering. In: Proceedings of the European conference on machine learning (ECML), pp 87–98
Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006a), Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories, MSR’06, ACM, New York, NY, USA, pp 137–143
Bird C, Gourley A, Devanbu P, Gertz M , Swaminathan A (2006b) Mining email social networks in postgres. In: Proceedings of the 2006 international workshop on mining software repositories, MSR’06, ACM, New York, NY, USA, pp 185–186
Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29:63–92
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Boykin PO, Roychowdhury VP (2004) Personal email networks: an effective anti-spam tool. Computing Research Repository (CoRR) cond-mat/0402143
Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30:1145–1159
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification regression trees, 1st edn. Wadsworth and Brooks, Monterey, CA
Brutlag JD, Meek C (2000) Challenges of the email domain for text classification. In: Proceedings of the seventeenth international conference on machine learning, ICML’00, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 103–110
Campbell CS, Maglio PP, Cozzi A, Dom B (2003) Expertise identification using email communications. In: Proceedings of the twelfth international conference on Information and knowledge management, CIKM’03, ACM, New York, NY, USA, pp 528–531
Carvalho VR, Cohen WW (2008) Ranking users for intelligent message addressing. In: Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR’08, Springer, Berlin, Heidelberg, pp 321–333
Claburn T (2005) Spam costs billions. Website http://www.informationweek.com/news/59300834
Cohen W (1996) Learning rules that classify e-mail. In: Papers from the association for the advancement of artificial intelligence (AAAI) spring symposium on machine learning in information access, AAAI Press, pp 18–25
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international Conference on machine learning, Morgan Kaufmann, pp 115–123
Cormack G, Lynam T (2004) A study of supervised spam detection applied to eight months of personal e-mail
Cormack G, Lynam T (2005) Spam corpus creation for trec. In: Proceedings of the second conference on email and anti-spam (CEAS), Mountain View, CA
Corney MW, Anderson AM, Mohay GM, de Vel O (2001) Identifying the authors of suspect email
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Cui Y, Pei J, Tang G, Jiang D, Luk W-S, Hua M (2011) Finding email correspondents in online social networks. World Wide Web J, 2012, Springer, Netherlands, pp 1–24
Dabbish LA, Kraut RE (2006) Email overload at work: an analysis of factors associated with email strain. In: Proceedings of the (2006) 20th anniversary conference on computer supported cooperative work, CSCW’06. ACM, New York, NY, USA, pp 431–440
De Choudhury M, Mason WA, Hofman JM, Watts DJ (2010) Inferring relevant social networks from interpersonal communication. In: Proceedings of the 19th international conference on World wide web, WWW’10, ACM, New York, NY, USA, pp 301–310
de Vel O, Anderson A, Corney M, Mohay G (2001) Multi-topic e-mail authorship attribution forensics. In: Proceedings of the workshop on data mining for security applications, 8th ACM conference on computer security (CCS)
Delaney KJ, Vara V (2007) Will social features make email sexy again? Wall Str J, (18 Oct)
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054
Ducheneaut N, Watts LA (2005) In search of coherence: a review of e-mail research. Hum Comput Interact 20:11–48
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826
Golbeck J, Hendler JA (2004) Reputation network analysis for email filtering. In: Proceedings of the first conference on email and anti-spam (CEAS), Mountain View, CA
Golub GH, van Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD
Gomes LH, Castro FDO, Almeida RB, Bettencourt LMA, Almeida VAF, Almeida JM (2005) Improving spam detection based on structural similarity. Computing Research Repository (CoRR) abs/cs/0504012
Gomez JC, Boiy E, Moens M-F (2012) Highly discriminative statistical features for email classification. Knowl Inf Syst 31(3):23–57
Hőlzer R, Malin B, Sweeney L (2005) Email alias detection using social network analysis. In: Proceedings of the international conference on knowledge discovery and data mining (KDD) workshop on link discovery: issues, approaches, and applications, ACM Press
Internet Threats Trend Report Q1 2010 (2010), Company Press
Johansen L, Rowell M, Butler K, Mcdaniel P (2007) Email communities of interest. In: Proceedings of the fourth conference on email and anti-spam (CEAS), Mountain View, CA
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, UAI’95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 338–345
Jolliffe IT (1986) Principal component analysis. Springer, New York
Karagiannis T, Vojnovic M (2009) Behavioral profiles for advanced email features. In: Proceedings of the 18th international conference on world wide web’, WWW’09, ACM, New York, NY, USA, pp 711–720
Katakis I, Tsoumakas G, Vlahavas I (2007) Web data management practices: emerging techniques and technologies. IGI Publishing, Hershey, PA
Keila PS, Skillicorn DB (2005) Structure in the enron email data set. Comput Math Organ Theory 11:183–199
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46:604–632
Klimt B, Yang Y (2004) The enron corpus: A new data set for email classification research. In: The European conference on machine learning (ECML), pp 217–226
Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177:2167–2187
Lam H-Y, Yeung D-Y (2007) A learning approach to spam detection based on social networks. In: Proceedings of the fourth conference on email and anti-spam (CEAS), Mountain View, CA
Lockerd A, Selker T (2003) DriftCatcher: The implicit social context of email. In: Proceedings of the ninth IFIP TC13 international conference on human–computer interaction (INTERACT) 2003, pp 1–5
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, University of California Press, pp 281–297
McArthur R, Bruza P (2003) Discovery of implicit and explicit connections between people using email utterance. In: Proceedings of the eighth conference on European conference on computer supported cooperative work (ECSCW) 2003, Kluwer Academic Publishers, Norwell, MA, USA, pp 21–40
Mccallum A, Corrada-emmanuel A, Wang X (2004) The author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email, Technical report, University of Massachusetts Amherst
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop on learning for text categorization, AAAI Press, pp 41–48
Myers JL, Well AD (2003) Research design and statistical analysis, 2nd edn. Lawrence Erlbaum, Hillsdale, NJ
Nagwani NK, Bhansali A (2010) An object oriented email clustering model using weighted similarities between emails attributes. Int J Res Rev Comput Sci 1(2):1–6
Neustaedter C, Brush AJB, Smith MA (2005) Beyond ”from” and ”received”: exploring the dynamics of email triage. In: ACM CHI’05 extended abstracts on human factors in computing systems, CHI EA’05, ACM, New York, NY, USA, pp 1977–1980
Nucleus Research Inc. (2007) Spam, the repeat offender. Notes and reports
Perer A, Smith MA (2006) Contrasting portraits of email practices: visual approaches to reflection and analysis. In: Proceedings of the working conference on advanced visual interfaces, AVI’06, ACM, New York, NY, USA, pp 389–395
Radicati S, Hoang Q (2010) Email statistics report, 2011–2015. Company Press
Rennie JDM (2000) Ifile: An application of machine learning to e-mail filtering. In: Proceedings of the international conference on knowledge discovery and data mining (KDD) workshop on text mining
Rijsbergen CJV (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA
Rios G, Zha H (2004) Exploring support vector machines and random forests for spam detection. In: Proceedings of the first conference on email and anti-spam (CEAS), Mountain View, CA
Roth M, Ben-David A, Deutscher D, Flysher G, Horn I, Leichtberg A, Leiser N, Matias Y, Merom R (2010) Suggesting friends using the implicit social graph. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’10, ACM, New York, NY, USA, pp 233–242
Rowe R, Creamer G, Hershkop S, Stolfo SJ (2007) Automated social hierarchy detection through email network analysis. In: Proceedings of the 9th WebKDD and 1st SNA-KDD (2007) workshop on Web mining and social network analysis, WebKDD/SNA-KDD’07. ACM, New York, NY, USA, pp 109–117
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, Inc., New York, NY
Salton G, Wong A, Yang CS (1997) A vector space model for automatic indexing. In: Sparck Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 273–280
Sasaki M, Shinnou H (2005) Spam detection using text clustering. In: Proceedings of the 2005 international conference on cyberworlds (CW), IEEE Computer Society, Washington, DC, USA, pp 316–319
Scheffer T (2004) Email answering assistance by semi-supervised text classification. Intell Data Anal 8:481–493
Schwartz MF, Wood DCM (1993) Discovering shared interests using graph analysis. Commun ACM 36:78–89
Segal RB, Kephart JO (1999) Mailcat: an intelligent assistant for organizing e-mail. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence’, AAAI’99/IAAI’99, American Association for Artificial Intelligence, Menlo Park, CA, USA, pp 925–926
Silverman BW, Jones MC (1951) E. fix and j.l. hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation: commentary on fix and hodges (1951). Int Stat Rev/Revue Internationale de Statistique 57(3):233–238
Sparck Jones K (1988) A statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London
Stolfo SJ, Hershkop S, Wang K, Nimeskern O, Hu C-W (2003a), A behavior-based approach to securing email systems. In: Proceedings of the Computer network security, second international workshop on mathematical methods, models, and architectures for computer network security, MMM-ACNS 2003, St. Petersburg, Russia, September 21–23, 2003 (Lecture Notes in Computer Science) vol 2776. Springer
Stolfo SJ, Hershkop S, Wang K, Nimeskern O, Hu C-W (2003b), Behavior profiling of email. In: Proceedings of the 1st NSF/NIJ conference on intelligence and security informatics, ISI’03, Springer, Berlin, Heidelberg, pp 74–90
Stuit M, Wortmann H (2012) Discovery and analysis of email-driven business processes. Inf Syst 37(2):142–168
Taylor B (2006) Sender reputation in a large webmail service. In: Proceedings of the third conference on email and anti-spam (CEAS), Mountain View, CA
Techopedia.com (n.d.) Social network analysis (SNA). Website http://www.techopedia.com/definition/3205/social-network-analysis-sna
Tyler JR, Wilkinson DM, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Communities and technologies, Kluwer, B.V., Deventer, The Netherlands, pp 81–96
van Rijsbergen C, Robertson S, Porter M (1980) New models in probabilistic information retrieval
Venolia GD, Neustaedter C (2003) Understanding sequence and reply relationships within email conversations: a mixed-model visualization. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI’03), ACM, New York, NY, USA, pp 361–368
Viégas FB, Golder S, Donath J (2006) Visualizing email content: portraying relationships from conversational histories. In: Grinter R, Rodden T, Aoki P, Cutrell E, Jeffries R, Olson G (eds) Proceedings of the SIGCHI conference on human factors in computing systems, CHI’06, ACM, New York, NY, USA, pp 979–988
Vleck TV (2001) The history of electronic mail. Website http://www.multicians.org/thvv/mail-history.html
Wang M-F, Jheng S-L, Tsai M-F, Tang C-H (2011) Enterprise email classification based on social network features. In: Proceedings of the international conference on advances in social networks analysis and mining, 2011, IEEE Computer Society, Washington, DC, USA, pp 532–536
Wang X-L, Cloete I (2005) Learning to classify email: a survey. In: Proceedings of the international conference on machine learning and, cybernetics, 2005, vol 9, pp 5716–5719
Whittaker S, Sidner C (1996) Email overload: exploring personal information management of email. In: Proceedings of the special interest group on computer human interaction (SIGCHI) conference on Human factors in computing systems: common ground, CHI’96, ACM, New York, NY, USA, pp 276–283
Whittaker S, Matthews T, Cerruti J, Badenes H, Tang J (2011) Am I wasting my time organizing email? A study of email refinding. In: Proceedings of the 2011 annual conference on human factors in computing systems, CHI’11. ACM, New York, NY, USA, pp 3449–3458
Wikipedia (2012) E-mail spam. Website http://en.wikipedia.org/wiki/E-mail_spam
Yang Y (2001) A study on thresholding strategies for text categorization. In: Proceedings of the 24th ACM international conference on research and development in information retrieval. ACM Press, pp 137–145
Yarow J (2011) 107,000,000,000,000. Website http://www.businessinsider.com/internet-statistics-2011-1-2011-1
Yoo S, Yang Y, Lin F, Moon I-C (2009) Mining social networks for personalized email prioritization. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’09, ACM, New York, NY, USA, pp 967–976
Acknowledgments
We are grateful to the anonymous reviewers for their very useful comments and suggestions. This research is supported in part by an NSERC Discovery Grant, a BCFRST NRAS Endowment Research Team Program project and a GRAND NCE project. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, G., Pei, J. & Luk, WS. Email mining: tasks, common techniques, and tools. Knowl Inf Syst 41, 1–31 (2014). https://doi.org/10.1007/s10115-013-0658-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0658-2