Using Data Mining Methods to Predict Personally Identifiable Information in Emails

Geng, Liqiang; Korba, Larry; Wang, Xin; Wang, Yunli; Liu, Hongyu; You, Yonghua

doi:10.1007/978-3-540-88192-6_26

Liqiang Geng⁶,
Larry Korba⁶,
Xin Wang⁷,
Yunli Wang⁶,
Hongyu Liu⁶ &
…
Yonghua You⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2625 Accesses
7 Citations

Abstract

Private information management and compliance are important issues nowadays for most of organizations. As a major communication tool for organizations, email is one of the many potential sources for privacy leaks. Information extraction methods have been applied to detect private information in text files. However, since email messages usually consist of low quality text, information extraction methods for private information detection may not achieve good performance. In this paper, we address the problem of predicting the presence of private information in email using data mining and text mining methods. Two prediction models are proposed. The first model is based on association rules that predict one type of private information based on other types of private information identified in emails. The second model is based on classification models that predict private information according to the content of the emails. Experiments on the Enron email dataset show promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Spam Mail Detection Using Data Mining: A Comparative Analysis

Large-Scale Information Extraction from Emails with Data Constraints

Utilising Machine Learning Against Email Phishing to Detect Malicious Emails

References

Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, pp. 487–499 (1994)
Google Scholar
Agrawal, R., Srikant, R.: Privacy-Preserving Data Mining. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, Texas, pp. 439–450 (2000)
Google Scholar
Armour, Q., Elazmeh, W., El-Kadri, N., Japkowicz, N., Matwin, S.: Privacy Compliance Enforcement in Email. In: Canadian Conference on AI, pp. 194–204 (2005)
Google Scholar
Boufaden, N., Elazmeh, W., Ma, Y., Matwin, S., El-Kadri, N., Japkowicz, N.: PEEP - An Information Extraction base approach for Privacy Protection in Email. In: CEAS (2005)
Google Scholar
Carvalho, V.R., Cohen, W.W.: Preventing Information Leaks in Email. In: SDM (2007)
Google Scholar
Evfimievski, A., Srikant, R., Agrawal, R., Gehrke, J.: Privacy Preserving Mining of Association Rules. In: Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2002)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document Metadata Extraction Using Support Vector Machines. In: Proceedings of the 2003 Joint Conference o Digital Libraries (JDCL 2003), pp. 37–48 (2003)
Google Scholar
Korba, L., Song, R., Yee, G., Patrick, A., Buffett, S., Wang, Y., Geng, L.: Private Data Management in Collaborative Environments. In: Luo, Y. (ed.) CDVE 2007. LNCS, vol. 4674, pp. 88–96. Springer, Heidelberg (2007)
Chapter Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Micheline Kamber Publishers (2006)
Google Scholar
Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
http://www.isi.edu/~adibi/Enron/Enron.htm
http://en.wikipedia.org/wiki/Luhn_algorithm

Download references

Author information

Authors and Affiliations

Institute of Information Technology, National Research Council of Canada Fredericton, New Brunswick, Canada
Liqiang Geng, Larry Korba, Yunli Wang, Hongyu Liu & Yonghua You
Department of Geomatics Engineering, University of Calgary, Calgary, Alberta, Canada
Xin Wang

Authors

Liqiang Geng
View author publications
You can also search for this author in PubMed Google Scholar
Larry Korba
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yunli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yonghua You
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Sichuan University, 610065, Chengdu, China
Changjie Tang
Department of Computer Science, The University of Western Ontario, Canada
Charles X. Ling
School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Faculty of Science & Engineering, York University, 355 Lumbers Building, M3J 1P3, Toronto, Ontario, Canada
Nick J. Cercone
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, 4072, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geng, L., Korba, L., Wang, X., Wang, Y., Liu, H., You, Y. (2008). Using Data Mining Methods to Predict Personally Identifiable Information in Emails. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-540-88192-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics