ABSTRACT
This paper examines the problem of inferring underground family truth from inconsistent antivirus vendor labels. Our insight is that vendors are not equally reliable, so we construct a two-dimensional probability matrix for each vendor to model its ability to identify diverse families. Then we formalize the inference task as a maximum likelihood estimation problem with hidden random variables and propose a solution based on the expectation-maximization algorithm.
We first evaluate our model on Malgenome, a popular Android dataset with 1,234 samples, the results indicate that our scheme could achieve 93.19% precision with 86.82% recall, which outperforms related work, especially on some unknown and ambiguous families. Then, we build a larger dataset to verify the robustness of our method, which contains 10,165 samples randomly selected from 12 Windows and Linux malware families, and experiment results illustrate that our solution also obtains significant precision and recall.
- Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In Network and Distributed Systems Security Symposium.Google ScholarCross Ref
- Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1007--1015.Google ScholarDigital Library
- Yongkang Jiang, Shenghong Li, Yue Wu, and Futai Zou. 2019. A Novel Image-Based Malware Classification Model Using Deep Learning. In International Conference on Neural Information Processing. Springer, 150--161.Google ScholarDigital Library
- J-Michael Roberts. 2011. VirusShare: Online Malware Repository Project. https://virusshare.comGoogle Scholar
- Robert Svensson. 2016. DasMalerk.eu: Live Malware Repository. https://dasmalwerk.euGoogle Scholar
- Nir Nissim, Robert Moskovitch, Lior Rokach, and Yuval Elovici. 2014. Novel active learning methods for enhanced PC malware detection in windows OS. Expert Systems with Applications 41, 13 (2014), 5843--5857.Google ScholarCross Ref
- Antonio Nappa, M Zubair Rafique, and Juan Caballero. 2015. The MALICIA dataset: identification and analysis of drive-by download operations. International Journal of Information Security 14, 1 (2015), 15--33.Google ScholarDigital Library
- Shuofei Zhu, Jianjun Shi, Limin Yang, Boqin Qin, Ziyi Zhang, Linhai Song, and Gang Wang. 2020. Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines. In USENIX Security Symposium.Google Scholar
- VirusTotal Team. 2013. VirusTotal-Free Online Virus, Malware and URL Scanner. https://www.virustotal.comGoogle Scholar
- Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Bena-tallah, and Mohammad Allahbakhsh. 2018. Quality control in crowd-sourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1--40.Google ScholarDigital Library
- Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security. 45--56.Google ScholarDigital Library
- Pang Du, Zheyuan Sun, Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. 2018. Statistical estimation of malware detection metrics in the absence of ground truth. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 2965--2980.Google ScholarDigital Library
- Aziz Mohaisen and Omar Alrawi. 2014. Av-meter: An evaluation of antivirus scans and labels. In International conference on detection of intrusions and malware, and vulnerability assessment. Springer, 112--131.Google ScholarCross Ref
- Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20--28.Google ScholarCross Ref
- Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552.Google ScholarDigital Library
- Yajin Zhou and Xuxian Jiang. 2012. Dissecting android malware: Characterization and evolution. In 2012 IEEE symposium on security and privacy. IEEE, 95--109.Google ScholarDigital Library
- Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. Avclass: A tool for massive malware labeling. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarCross Ref
- CARO. 1991. A New Virus Naming Convention. http://www.caro.org/articles/naming.htmlGoogle Scholar
- Stefano Schiavoni, Federico Maggi, Lorenzo Cavallaro, and Stefano Zanero. 2014. Phoenix: DGA-based botnet tracking and intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 192--211.Google ScholarCross Ref
Index Terms
- EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference
Recommendations
Random swap EM algorithm for Gaussian mixture models
Expectation maximization (EM) algorithm is a popular way to estimate the parameters of Gaussian mixture models. Unfortunately, its performance highly depends on the initialization. We propose a random swap EM for the initialization of EM. Instead of ...
Malicious SSL Certificate Detection: A Step Towards Advanced Persistent Threat Defence
ICFNDS '17: Proceedings of the International Conference on Future Networks and Distributed SystemsAdvanced Persistent Threat (APT) is one of the most serious types of cyber attacks, which is a new and more complex version of multistep attack. Within the APT life cycle, continuous communication between infected hosts and Command and Control (C&C) ...
Formulistic Detection of Malicious Fast-Flux Domains
PAAP '12: Proceedings of the 2012 Fifth International Symposium on Parallel Architectures, Algorithms and ProgrammingBonnet creates harmful network attacks nowadays. Lawbreaker may implant malware into victim machines using botnets and, furthermore, he employs fast-flux domain technology to improve the lifetime of botnets. To circumvent the detection of command and ...
Comments