Abstract
The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.
Similar content being viewed by others
Abbreviations
- H.2.8 (Database Management):
-
Database Applications–Data Mining
- I.5.3 (Pattern Recognition):
-
Clustering–Algorithms, Similarity measures
- I.5.4 (Pattern Recognition):
-
Applications–Text processing
- H.4.3 (Information Systems Applications):
-
Communications Applications–electronic mail
References
Agrawal, R., Bayardo, R., & Srikant, R. (2000). ATHENA: Mining-based interactive management of text databases. In Proceedings of the International Conference on Extending Database Technology (EDBT) (pp. 365–379). Konstanz, Germany.
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).
Allan, J., Papka, R., & Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR)(pp. 37–45). Melbourne, Australia.
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An Evaluation of naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9–17). Barcelona, Spain.
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval, ISBN-0-201-39829-X. New York: ACM.
Boone, G. (1998). Concept features in re: Agent, an intelligent e-mail agent. In Proceedings of the International Conference on Autonomous Agents. (pp. 141–148). Minneapolis: ACM.
Cohen, W. (1996). Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium in Information Access. Stanford, California.
Crawford, E., Kay, J., & McCreath, E. (2001). Automatic induction of rules for e-mail classification. In Proceedings of the Australasian Document Computing Symposium (pp. 13–20). Coffs Harbour, NSW Australia.
Cutting, D., David, K., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 318–329). Copenhagen, Denmark.
Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse data using clustering. Machine Learning, 42, 143–175.
Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2/3), 103–130.
Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.
Fisher, D. (1987). Concept acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.
Gennari, J., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Dallas, Texas (pp. 1–12). New York: ACM.
Hidalgo, J., López, M., & Sanz, E. (2000). Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL) (pp. 99–102). Lisbon, Portugal.
Huang, Z. (1998). Extensions to the k-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Jain, A. & Dubes, R. (1988). Algorithms for clustering data, Prentice-Hall advanced reference series. Englewood Cliffs, New Jersey: Prentice-Hall.
Jain, A., Murthy, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Kilander, F., Fahraeus, E., & Palme, J. (1997). Intelligent information filtering. Technical report, Department of Computer and Systems Sciences, Stockholm University. Available at http://www.dsv.su.se/~fk/if_Doc/IntFilter.html.
Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology. San Diego, California.
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conf on Machine Learning (ECML). (pp. 4–15). Berlin Heidelberg New York: Springer.
Lewis, D. D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). (pp. 3–12). Berlin Heidelberg New York: Springer.
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR) (pp. 81–93).
McCallum, A., & Nigam, K. (1998). A Comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization (pp. 41–48). Madison, Wisconsin.
Mitchell, T. (1997). Machine Learning, Computer Sciences Series. New York: McGraw-Hill.
Mock, K. (1999). Dynamic email organization via relevance categories. In Proceedings of the IEEE International Conference on Tools With Artificial Intelligence (ICTAI) (pp. 399–405). Chicago, Illinois.
Pantel, P., & Lin, D. (1998). SpamCop: A spam classification and organization program. In Proceedings of the AAAI Workshop on Learning For Text Categorization (pp. 95–98). Madison, Wisconsin.
Payne, T. R., & Edwards, P. (1997). Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1), 1–32.
Segal, R., & Kephart, J. (1999). MailCat: An intelligent assistant for organizing e-mail. In Proceedings of the International Conference on Autonomous Agents. Seattle, Washington. (pp. 276–282). New York: ACM.
Selim, S. Z., & Ismail, M. A. (1984). K-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the ACM SIGKDD International Workshop on Text Mining. Boston, Massachusetts.
Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Proceedings of the AAAI workshop on artificial intelligence for web search, Austin, Texas. (pp. 58–64). California: AAAI.
Swan, R., & Allan, J. (2000). Automatic generation of overview timelines. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). Athens, Greece (pp. 49–56). New York: ACM.
Whittaker, S., & Sidner, C. (1996). Email overload: exploring personal information management of email. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). (pp. 276–283). New York: ACM.
Yang, Y., Pierce, T., & Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 28–36). Melbourne, Australia.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Manco, G., Masciari, E. & Tagarelli, A. Mining categories for emails via clustering and pattern discovery. J Intell Inf Syst 30, 153–181 (2008). https://doi.org/10.1007/s10844-006-0024-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-0024-x