ABSTRACT
In recent years, spam email has become a major tool for criminals to conduct illegal business on the Internet. Therefore, in this paper we describe a new research approach that uses data mining techniques to study spam emails with the focus on law enforcement forensic analysis. After we retrieve useful attributes from spam emails, we use a connected components clustering algorithm to form relationships between messages. These initial clusters are then refined by using a weighted edges model where membership in the cluster requires the weight to exceed a chosen threshold. The results of the cluster membership are validated by WHOIS data, by the IP address of the computer hosting the advertised sites, and through comparison of graphical images of website fetches. This technique has been successful in identifying relationships between spam campaigns that were not identified by human researchers, enabling additional data to be brought into a single investigation.
- Airoldi, E. and Malin, B. ScamSlam: An Architecture for Learning the Criminal Relations Behind Scam Spam. Carnegie Mellon University, School of Computer Science, Technical Report CMU-ISRI-04-121. Pittsburgh: May 2004.Google Scholar
- Baase, S. Computer Algorithms: Introduction to Design and Analysis. (2nd ed.). Addison-Wesley, 1988. Google ScholarDigital Library
- Clark, J., Koprinska, I. and Poon, J. A neural network based approach to automated e-mail classification. In Proceedings of IEEE/WIC International Conference on Web Intelligence, 13, 17, (Oct. 2003), 702--705. Google ScholarDigital Library
- Drucker, H., Wu, D. and Vapnik, V. N. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10, 5, (Sep 1999), 1048--1054. Google ScholarDigital Library
- Han, J. and Kamber, M. Data Mining: Concepts and Techniques. (2nd ed.). Morgan Kaufmann, San Francisco, CA, 2006. Google ScholarDigital Library
- Jung, J. and Sit, E. An empirical study of spam traffic and the use of DNS black lists. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement. (Oct. 2004) 370--375. Google ScholarDigital Library
- Sahami, M., Dumais S., Heckerman, D. and Horvitz, E. A Bayesian approach to filtering junk email. AAAI Workshop on Learning for Text Categorization, AAAI Technical Report WS-98-05. Madison, Wisconsin. July 1998. 55--62.Google Scholar
- Sanpakdee, U., Walairacht, A. and Walairacht, S. Adaptive spam mail filtering using genetic algorithm. In Proceedings of the 8th International Conference on Advanced Communication Technology. (Feb. 2006). 441--445.Google Scholar
- Soucy. P and Mineau, G. W. A simple KNN algorithm for text categorization. In Proceedings of 2001 IEEE International Conference on Data Mining, (Nov - Dec 2001) 647--648. Google ScholarDigital Library
- Stolfo, S. Email Mining Toolkit Supporting Law Enforcement Forensic Analyses. NSF Final Report. DG.o 2005 Atlanta, GA. May 2005.Google Scholar
- Vel, O. D., Anderson, A., Corney, M. and Mohay, G. Mining Email Content for Author Identification Forensics. SIGMOD: Special Section on Data Mining for Intrusion Detection and Threat Analysis, 30, 4, (Dec. 2001) 55--64. Google ScholarDigital Library
- Yang, Y. and Liu, X. A Re-examination of text categorization methods. In Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Aug. 1999). 42--49. Google ScholarDigital Library
- Zhao, W. and Zhang, Z. An email classification model based on rough set theory. In Proceedings of the 2005 International Conference on Active Media Technology. (May 2005). 403--40.Google Scholar
Index Terms
- Mining spam email to identify common origins for forensic application
Recommendations
Filtering spam with behavioral blacklisting
CCS '07: Proceedings of the 14th ACM conference on Computer and communications securitySpam filters often use the reputation of an IP address (or IP address range) to classify email senders. This approach worked well when most spam originated from senders with fixed IP addresses, but spam today is also sent from IP addresses for which ...
Detection of networks blocks used by the Storm Worm botnet
ACM-SE 46: Proceedings of the 46th Annual Southeast Regional Conference on XXStorm Worm is a prolific web-spread Trojan virus that infects computers and turns them into nodes (called bots) of a botnet. The bots then can be used to distribute spam messages, launch DOS attacks, host phishing web sites, etc. This paper investigated ...
Clustering malware-generated spam emails with a novel fuzzy string matching algorithm
SAC '09: Proceedings of the 2009 ACM symposium on Applied ComputingIn this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A ...
Comments