Mining categories for emails via clustering and pattern discovery

Manco, Giuseppe; Masciari, Elio; Tagarelli, Andrea

doi:10.1007/s10844-006-0024-x

Mining categories for emails via clustering and pattern discovery

Published: 25 January 2007

Volume 30, pages 153–181, (2008)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Giuseppe Manco¹,
Elio Masciari¹ &
Andrea Tagarelli²

171 Accesses
6 Citations
Explore all metrics

Abstract

The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abbreviations

H.2.8 (Database Management):: Database Applications–Data Mining
I.5.3 (Pattern Recognition):: Clustering–Algorithms, Similarity measures
I.5.4 (Pattern Recognition):: Applications–Text processing
H.4.3 (Information Systems Applications):: Communications Applications–electronic mail

References

Agrawal, R., Bayardo, R., & Srikant, R. (2000). ATHENA: Mining-based interactive management of text databases. In Proceedings of the International Conference on Extending Database Technology (EDBT) (pp. 365–379). Konstanz, Germany.
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).
Allan, J., Papka, R., & Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR)(pp. 37–45). Melbourne, Australia.
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An Evaluation of naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9–17). Barcelona, Spain.
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval, ISBN-0-201-39829-X. New York: ACM.
Google Scholar
Boone, G. (1998). Concept features in re: Agent, an intelligent e-mail agent. In Proceedings of the International Conference on Autonomous Agents. (pp. 141–148). Minneapolis: ACM.
Chapter Google Scholar
Cohen, W. (1996). Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium in Information Access. Stanford, California.
Crawford, E., Kay, J., & McCreath, E. (2001). Automatic induction of rules for e-mail classification. In Proceedings of the Australasian Document Computing Symposium (pp. 13–20). Coffs Harbour, NSW Australia.
Cutting, D., David, K., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 318–329). Copenhagen, Denmark.
Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse data using clustering. Machine Learning, 42, 143–175.
Article MATH Google Scholar
Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2/3), 103–130.
Article MATH Google Scholar
Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.
Article Google Scholar
Fisher, D. (1987). Concept acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.
Google Scholar
Gennari, J., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.
Article Google Scholar
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Dallas, Texas (pp. 1–12). New York: ACM.
Google Scholar
Hidalgo, J., López, M., & Sanz, E. (2000). Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL) (pp. 99–102). Lisbon, Portugal.
Huang, Z. (1998). Extensions to the k-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Article Google Scholar
Jain, A. & Dubes, R. (1988). Algorithms for clustering data, Prentice-Hall advanced reference series. Englewood Cliffs, New Jersey: Prentice-Hall.
MATH Google Scholar
Jain, A., Murthy, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Article Google Scholar
Kilander, F., Fahraeus, E., & Palme, J. (1997). Intelligent information filtering. Technical report, Department of Computer and Systems Sciences, Stockholm University. Available at http://www.dsv.su.se/~fk/if_Doc/IntFilter.html.
Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology. San Diego, California.
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conf on Machine Learning (ECML). (pp. 4–15). Berlin Heidelberg New York: Springer.
Google Scholar
Lewis, D. D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). (pp. 3–12). Berlin Heidelberg New York: Springer.
Google Scholar
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR) (pp. 81–93).
McCallum, A., & Nigam, K. (1998). A Comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization (pp. 41–48). Madison, Wisconsin.
Mitchell, T. (1997). Machine Learning, Computer Sciences Series. New York: McGraw-Hill.
MATH Google Scholar
Mock, K. (1999). Dynamic email organization via relevance categories. In Proceedings of the IEEE International Conference on Tools With Artificial Intelligence (ICTAI) (pp. 399–405). Chicago, Illinois.
Pantel, P., & Lin, D. (1998). SpamCop: A spam classification and organization program. In Proceedings of the AAAI Workshop on Learning For Text Categorization (pp. 95–98). Madison, Wisconsin.
Payne, T. R., & Edwards, P. (1997). Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1), 1–32.
Article Google Scholar
Segal, R., & Kephart, J. (1999). MailCat: An intelligent assistant for organizing e-mail. In Proceedings of the International Conference on Autonomous Agents. Seattle, Washington. (pp. 276–282). New York: ACM.
Google Scholar
Selim, S. Z., & Ismail, M. A. (1984). K-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.
Article MATH Google Scholar
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the ACM SIGKDD International Workshop on Text Mining. Boston, Massachusetts.
Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Proceedings of the AAAI workshop on artificial intelligence for web search, Austin, Texas. (pp. 58–64). California: AAAI.
Google Scholar
Swan, R., & Allan, J. (2000). Automatic generation of overview timelines. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). Athens, Greece (pp. 49–56). New York: ACM.
Google Scholar
Whittaker, S., & Sidner, C. (1996). Email overload: exploring personal information management of email. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). (pp. 276–283). New York: ACM.
Google Scholar
Yang, Y., Pierce, T., & Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 28–36). Melbourne, Australia.

Download references

Author information

Authors and Affiliations

ICAR-CNR, 87036, Rende (CS), Italy
Giuseppe Manco & Elio Masciari
DEIS, University of Calabria, 87036, Rende (CS), Italy
Andrea Tagarelli

Authors

Giuseppe Manco
View author publications
You can also search for this author in PubMed Google Scholar
Elio Masciari
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Tagarelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Tagarelli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manco, G., Masciari, E. & Tagarelli, A. Mining categories for emails via clustering and pattern discovery. J Intell Inf Syst 30, 153–181 (2008). https://doi.org/10.1007/s10844-006-0024-x

Download citation

Received: 20 February 2004
Revised: 30 November 2005
Accepted: 02 June 2006
Published: 25 January 2007
Issue Date: April 2008
DOI: https://doi.org/10.1007/s10844-006-0024-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining categories for emails via clustering and pattern discovery

Abstract

Access this article

Similar content being viewed by others

An Automatic Email Management Approach Using Data Mining Techniques

A New Algorithm to Categorize E-mail Messages to Folders with Social Networks Analysis

Automatic Categorization of Email into Folders by Ant Colony Decision Tree and Social Networks

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining categories for emails via clustering and pattern discovery

Abstract

Access this article

Similar content being viewed by others

An Automatic Email Management Approach Using Data Mining Techniques

A New Algorithm to Categorize E-mail Messages to Folders with Social Networks Analysis

Automatic Categorization of Email into Folders by Ant Colony Decision Tree and Social Networks

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation