skip to main content
10.1145/1363686.1364019acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Mining spam email to identify common origins for forensic application

Published:16 March 2008Publication History

ABSTRACT

In recent years, spam email has become a major tool for criminals to conduct illegal business on the Internet. Therefore, in this paper we describe a new research approach that uses data mining techniques to study spam emails with the focus on law enforcement forensic analysis. After we retrieve useful attributes from spam emails, we use a connected components clustering algorithm to form relationships between messages. These initial clusters are then refined by using a weighted edges model where membership in the cluster requires the weight to exceed a chosen threshold. The results of the cluster membership are validated by WHOIS data, by the IP address of the computer hosting the advertised sites, and through comparison of graphical images of website fetches. This technique has been successful in identifying relationships between spam campaigns that were not identified by human researchers, enabling additional data to be brought into a single investigation.

References

  1. Airoldi, E. and Malin, B. ScamSlam: An Architecture for Learning the Criminal Relations Behind Scam Spam. Carnegie Mellon University, School of Computer Science, Technical Report CMU-ISRI-04-121. Pittsburgh: May 2004.Google ScholarGoogle Scholar
  2. Baase, S. Computer Algorithms: Introduction to Design and Analysis. (2nd ed.). Addison-Wesley, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Clark, J., Koprinska, I. and Poon, J. A neural network based approach to automated e-mail classification. In Proceedings of IEEE/WIC International Conference on Web Intelligence, 13, 17, (Oct. 2003), 702--705. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Drucker, H., Wu, D. and Vapnik, V. N. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10, 5, (Sep 1999), 1048--1054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. (2nd ed.). Morgan Kaufmann, San Francisco, CA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jung, J. and Sit, E. An empirical study of spam traffic and the use of DNS black lists. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement. (Oct. 2004) 370--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sahami, M., Dumais S., Heckerman, D. and Horvitz, E. A Bayesian approach to filtering junk email. AAAI Workshop on Learning for Text Categorization, AAAI Technical Report WS-98-05. Madison, Wisconsin. July 1998. 55--62.Google ScholarGoogle Scholar
  8. Sanpakdee, U., Walairacht, A. and Walairacht, S. Adaptive spam mail filtering using genetic algorithm. In Proceedings of the 8th International Conference on Advanced Communication Technology. (Feb. 2006). 441--445.Google ScholarGoogle Scholar
  9. Soucy. P and Mineau, G. W. A simple KNN algorithm for text categorization. In Proceedings of 2001 IEEE International Conference on Data Mining, (Nov - Dec 2001) 647--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Stolfo, S. Email Mining Toolkit Supporting Law Enforcement Forensic Analyses. NSF Final Report. DG.o 2005 Atlanta, GA. May 2005.Google ScholarGoogle Scholar
  11. Vel, O. D., Anderson, A., Corney, M. and Mohay, G. Mining Email Content for Author Identification Forensics. SIGMOD: Special Section on Data Mining for Intrusion Detection and Threat Analysis, 30, 4, (Dec. 2001) 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yang, Y. and Liu, X. A Re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Aug. 1999). 42--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhao, W. and Zhang, Z. An email classification model based on rough set theory. In Proceedings of the 2005 International Conference on Active Media Technology. (May 2005). 403--40.Google ScholarGoogle Scholar

Index Terms

  1. Mining spam email to identify common origins for forensic application

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SAC '08: Proceedings of the 2008 ACM symposium on Applied computing
        March 2008
        2586 pages
        ISBN:9781595937537
        DOI:10.1145/1363686

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 March 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,650of6,669submissions,25%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader