ABSTRACT
Anti-virus systems developed by different vendors often demonstrate strong discrepancies in how they name malware, which signficantly hinders malware information sharing. While existing work has proposed a plethora of malware naming standards, most anti-virus vendors were reluctant to change their own naming conventions. In this paper we explore a new, more pragmatic alternative. We propose to exploit the correlation between malware naming of different anti-virus systems to create their consensus classification, through which these systems can share malware information without modifying their naming conventions. Specifically we present Latin, a novel classification integration framework leveraging the correspondence between participating anti-virus systems as reflected in heterogeneous information sources at instance-instance, instance-name, and name-name levels. We provide results from extensive experimental studies using real malware datasets and concrete use cases to verify the efficacy of Latin in supporting cross-system malware information sharing.
- M. Bailey, J. Andersen, Z. M. Mao, and F. Jahanian. Automated classification and analysis of internet malware. In RAID, 2007. Google ScholarDigital Library
- P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. In VLDB, 2011.Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarDigital Library
- V. Bontchev. Current status of the caro malware naming scheme. www.people.frisk-software.com/?bontchev/papers/naming.html.Google Scholar
- P.-M. Bureau and D. Harley. A dose by any other name. In VB, 2008.Google Scholar
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
- CNET. Most popular security software: www.cnet.com.au/software/security/most-popular.htm, 2012.Google Scholar
- Damballa. Integration partners: www.damballa.com/solutions/integration_partners.php.Google Scholar
- A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In SIGMOD, 2001. Google ScholarDigital Library
- N. FitzGerald. A virus by any other name: Towards the revised caro naming convention. In AVAR, 2002.Google Scholar
- F. Giunchiglia and P. Shvaiko. Semantic matching. Knowl. Eng. Rev., 18(3):265--280. Google ScholarDigital Library
- D. Harley. The game of the name malware naming, shape shifters and sympathetic magic. In CFET, 2009.Google Scholar
- J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975. Google ScholarDigital Library
- T. Kelchner. The (in)consistent naming of malcode. Computer Fraud & Security, 2010(2):5--7.Google ScholarCross Ref
- F. Lin and W. W. Cohen. Power iteration clustering. In ICML, 2010.Google Scholar
- J. Lin. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor., 37(1):145--151. Google ScholarDigital Library
- B. Long, Z. M. Zhang, and P. S. Yu. Combining multiple clusterings by soft correspondence. In ICDM, 2005. Google ScholarDigital Library
- B. Luo, R. C. Wilson, and E. R. Hancock. Spectral clustering of graphs. In GbRPR, 2003. Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001. Google ScholarDigital Library
- F. Maggi, A. Bellini, G. Salvaneschi, and S. Zanero. Finding non-trivial malware naming inconsistencies. In ICISS, 2011. Google ScholarDigital Library
- S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In ICDE, 2002. Google ScholarDigital Library
- H. B. Newcombe and J. M. Kennedy. Record linkage: Making maximum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566. Google ScholarDigital Library
- M. D. Preda, M. Christodorescu, S. Jha, and S. Debray. A semantics-based approach to malware detection. In POPL, 2007. Google ScholarDigital Library
- K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. J. Comput. Secur., 19(4):639--668. Google ScholarDigital Library
- G. Scheidl. Virus naming convention 1999 (vnc99). http://members.chello.at/erikajo/vnc99b2.txt.Google Scholar
- T. Wang and R. Pottinger. Semap: a generic mapping construction system. In EDBT, 2008. Google ScholarDigital Library
- Y. Ye, T. Li, Y. Chen, and Q. Jiang. Automatic malware categorization using cluster ensemble. In KDD, 2010. Google ScholarDigital Library
Index Terms
- Rebuilding the Tower of Babel: Towards Cross-System Malware Information Sharing
Recommendations
Babel's tower revisited: a universal resource for cross-referencing across annotation databases
Motivation: Annotation databases are widely used as public repositories of biological knowledge. However, most of these resources have been developed by independent groups which used different designs and different identifiers for the same biological ...
Classification integration and reclassification using constraint databases
Objective: We propose classification integration as a new method for data integration from different sources. We also propose reclassification as a new method of combining existing medical classifications for different classes. Background: In many ...
WormTerminator: an effective containment of unknown and polymorphic fast spreading worms
ANCS '06: Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systemsThe fast spreading worm is becoming one of the most serious threats to today's networked information systems. A fast spreading worm could infect hundreds of thousands of hosts within a few minutes. In order to stop a fast spreading worm, we need the ...
Comments