Skip to main content
Log in

Mining categories for emails via clustering and pattern discovery

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The continuous exchange of information by means of the popular email service has raised the problem of managing the huge amounts of messages received from users in an effective and efficient way. We deal with the problem of email classification by conceiving suitable strategies for: (1) organizing messages into homogeneous groups, (2) redirecting further incoming messages according to an initial organization, and (3) building reliable descriptions of the message groups discovered. We propose a unified framework for handling and classifying email messages. In our framework, messages sharing similar features are clustered in a folder organization. Clustering and pattern discovery techniques for mining structured and unstructured information from email messages are the basis of an overall process of folder creation/maintenance and email redirection. Pattern discovery is also exploited for generating suitable cluster descriptions that play a leading role in cluster updating. Experimental evaluation performed on several personal mailboxes shows the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

H.2.8 (Database Management):

Database Applications–Data Mining

I.5.3 (Pattern Recognition):

Clustering–Algorithms, Similarity measures

I.5.4 (Pattern Recognition):

Applications–Text processing

H.4.3 (Information Systems Applications):

Communications Applications–electronic mail

References

  • Agrawal, R., Bayardo, R., & Srikant, R. (2000). ATHENA: Mining-based interactive management of text databases. In Proceedings of the International Conference on Extending Database Technology (EDBT) (pp. 365–379). Konstanz, Germany.

  • Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998a). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 194–218).

  • Allan, J., Papka, R., & Lavrenko, V. (1998b). On-line new event detection and tracking. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR)(pp. 37–45). Melbourne, Australia.

  • Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An Evaluation of naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age (pp. 9–17). Barcelona, Spain.

  • Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval, ISBN-0-201-39829-X. New York: ACM.

    Google Scholar 

  • Boone, G. (1998). Concept features in re: Agent, an intelligent e-mail agent. In Proceedings of the International Conference on Autonomous Agents. (pp. 141–148). Minneapolis: ACM.

    Chapter  Google Scholar 

  • Cohen, W. (1996). Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium in Information Access. Stanford, California.

  • Crawford, E., Kay, J., & McCreath, E. (2001). Automatic induction of rules for e-mail classification. In Proceedings of the Australasian Document Computing Symposium (pp. 13–20). Coffs Harbour, NSW Australia.

  • Cutting, D., David, K., Pedersen, J., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 318–329). Copenhagen, Denmark.

  • Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse data using clustering. Machine Learning, 42, 143–175.

    Article  MATH  Google Scholar 

  • Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2/3), 103–130.

    Article  MATH  Google Scholar 

  • Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.

    Article  Google Scholar 

  • Fisher, D. (1987). Concept acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.

    Google Scholar 

  • Gennari, J., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.

    Article  Google Scholar 

  • Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Dallas, Texas (pp. 1–12). New York: ACM.

    Google Scholar 

  • Hidalgo, J., López, M., & Sanz, E. (2000). Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL) (pp. 99–102). Lisbon, Portugal.

  • Huang, Z. (1998). Extensions to the k-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.

    Article  Google Scholar 

  • Jain, A. & Dubes, R. (1988). Algorithms for clustering data, Prentice-Hall advanced reference series. Englewood Cliffs, New Jersey: Prentice-Hall.

    MATH  Google Scholar 

  • Jain, A., Murthy, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.

    Article  Google Scholar 

  • Kilander, F., Fahraeus, E., & Palme, J. (1997). Intelligent information filtering. Technical report, Department of Computer and Systems Sciences, Stockholm University. Available at http://www.dsv.su.se/~fk/if_Doc/IntFilter.html.

  • Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Conference on Human Language Technology. San Diego, California.

  • Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conf on Machine Learning (ECML). (pp. 4–15). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Lewis, D. D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). (pp. 3–12). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR) (pp. 81–93).

  • McCallum, A., & Nigam, K. (1998). A Comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization (pp. 41–48). Madison, Wisconsin.

  • Mitchell, T. (1997). Machine Learning, Computer Sciences Series. New York: McGraw-Hill.

    MATH  Google Scholar 

  • Mock, K. (1999). Dynamic email organization via relevance categories. In Proceedings of the IEEE International Conference on Tools With Artificial Intelligence (ICTAI) (pp. 399–405). Chicago, Illinois.

  • Pantel, P., & Lin, D. (1998). SpamCop: A spam classification and organization program. In Proceedings of the AAAI Workshop on Learning For Text Categorization (pp. 95–98). Madison, Wisconsin.

  • Payne, T. R., & Edwards, P. (1997). Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1), 1–32.

    Article  Google Scholar 

  • Segal, R., & Kephart, J. (1999). MailCat: An intelligent assistant for organizing e-mail. In Proceedings of the International Conference on Autonomous Agents. Seattle, Washington. (pp. 276–282). New York: ACM.

    Google Scholar 

  • Selim, S. Z., & Ismail, M. A. (1984). K-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.

    Article  MATH  Google Scholar 

  • Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the ACM SIGKDD International Workshop on Text Mining. Boston, Massachusetts.

  • Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Proceedings of the AAAI workshop on artificial intelligence for web search, Austin, Texas. (pp. 58–64). California: AAAI.

    Google Scholar 

  • Swan, R., & Allan, J. (2000). Automatic generation of overview timelines. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR). Athens, Greece (pp. 49–56). New York: ACM.

    Google Scholar 

  • Whittaker, S., & Sidner, C. (1996). Email overload: exploring personal information management of email. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). (pp. 276–283). New York: ACM.

    Google Scholar 

  • Yang, Y., Pierce, T., & Carbonell, J. (1998). A study on retrospective and on-line event detection. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR) (pp. 28–36). Melbourne, Australia.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Tagarelli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manco, G., Masciari, E. & Tagarelli, A. Mining categories for emails via clustering and pattern discovery. J Intell Inf Syst 30, 153–181 (2008). https://doi.org/10.1007/s10844-006-0024-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0024-x

Keywords

Navigation