Abstract:
In recent years, researchers have made a great success on the automatic classification and detection of malware utilizing machine learning methods. However, most machine ...Show MoreMetadata
Abstract:
In recent years, researchers have made a great success on the automatic classification and detection of malware utilizing machine learning methods. However, most machine learning based approaches over rely on the training samples such that a new malware family not belonging to the training set cannot be identified. To address such issue, we propose a soft relevance value (s-value), a new evaluating way of feature soft relevance that uses the mixed distance criterion to assess classified results. Specifically, we leverage the mixed distance criterion from pattern recognition to distinguish testing samples as a new family which is not labeled in training set. Finally, we evaluate how s-value can be used to distinguish and classify a new malware family with the malware datasets from the Research Prediction Competition of Microsoft Malware Classification Challenge and Windows (Kaggle). The experimental results show that, the train-ing time is approximately 12 hours, while the prediction time is only ∼0.5 second. Comparing against the Kaggle winner, our time costs for training and pprediction only occupy 16.7% and 3.8% of the winner.s time costs, respectively. The accuracy of classifying malware reaches 99.8%. Such results indicates that our proposed s-value achieves a balance in accuracy, training and prediction time, and outperforms the state-of-the-art machine learning based malware detection approaches. Besides, our method is able to identify new malware families that are not included in the training set.
Published in: IEEE Transactions on Reliability ( Volume: 71, Issue: 1, March 2022)