Skip to main content
Log in

Structure in the Enron Email Dataset

  • Published:
Computational & Mathematical Organization Theory Aims and scope Submit manuscript

Abstract

We investigate the structures present in the Enron email dataset using singular value decomposition and semidiscrete decomposition. Using word frequency profiles, we show that messages fall into two distinct groups, whose extrema are characterized by short messages and rare words versus long messages and common words. It is surprising that length of message and word use pattern should be related in this way. We also investigate relationships among individuals based on their patterns of word use in email. We show that word use is correlated to function within the organization, as expected. Lastly, we show that relative changes to individuals' word usage over time can be used to identify key players in major company events.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • British National Corpus (BNC), (2004), http://www.natcorp.ox.ac.uk.

  • Cohen, W.W. (1996), “Learning to Classify English Text with ILP Methods,” in L. De Raedt (Eds.), Advances in Inductive Logic Programming, IOS Press, pp. 124–143.

  • Diesner, J. and K. Carley (2005), “Exploration of Communication Networks from the Enron Email Corpus,”in Workshop on Link Analysis, Counterterrorism and Security, SIAM International Conference on Data Mining, pp. 3–14.

  • European Parliament Temporary Committee on the ECHELON Interception System (2001), “Final Report on the Existence of a Global System for the Interception of Private and Commercial Communications,” Echelon Interception System.

  • Golub, G.H. and C.F. van Loan (1996), Matrix Computations, 3rd edn. Johns Hopkins University Press.

  • Kolda, G. and D.P. O'Leary (1998), “A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval,” ACM Transactions on Information Systems, 16, 322–346.

    Article  Google Scholar 

  • Kolda, T.G. and D.P. O'Leary (1999), “Computation and Uses of the Semidiscrete Matrix Decomposition,” ACM Transactions on Information Processing.

  • Lloyd, D. and N. Spruill (2001), “Security Screening and Knowledge Management in the department of defense,” in Federal Conference on Statistical Methodology.

  • McArthur, R. and P. Bruza (2003), Discovery of Implicit and Explicit Connections Between People Using Email Utterance,” in Proceedings of the Eighth European Conference of Computer-supported Cooperative Work, Helsinki, pp. 21–40.

  • McConnell, S. and D.B. Skillicorn (2002), “Semidiscrete Decomposition: A Bump Hunting Technique,” in Australasian Data Mining Workshop, pp. 75–82.

  • O'Brien, C. and C. Vogel (2004), “Exploring the Subject of Email Filtering: Feature Selection in Statistical Filtering.”

  • Shetty, J. and J. Adibi (2004), “The Enron Email Dataset Database Schema and Brief Statistical Report,” Technical report, Information Sciences Institute.

  • Simon, A.F. and M. Xenos (2004), “Dimensional Reduction of Word-Frequency Data as a Substitute for Intersubjective Content Analysis,” Political Analysis, 12, 63–75.

    Article  Google Scholar 

  • Skillicorn, D.B. (2005), “Beyond Keyword Filtering for Message and Conversation Detection,” in IEEE International Conference on Intelligence and Security Informatics (ISI2005), Springer-Verlag Lecture Notes in Computer Science LNCS 3495, pp. 231–243.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. S. Keila.

Additional information

P.S. Keila is a graduate student in the School of Computing at Queen's University. His research area is data mining in text.

D.B. Skillicorn is a professor in the School of Computing at Queen's University, where he heads the Smart Information Management Laboratory. His research area is data mining using matrix decompositions, particularly applied to complex datasets in areas such as biomedicine, geochemistry, counterterrorism and fraud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keila, P.S., Skillicorn, D.B. Structure in the Enron Email Dataset. Comput Math Organiz Theor 11, 183–199 (2005). https://doi.org/10.1007/s10588-005-5379-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10588-005-5379-y

Keywords

Navigation