Abstract:
Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in...Show MoreMetadata
Abstract:
Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-the-art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.
Date of Conference: 02-03 May 2019
Date Added to IEEE Xplore: 21 June 2019
ISBN Information: