skip to main content
10.1145/3422713.3422743acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference

Authors Info & Claims
Published:23 October 2020Publication History

ABSTRACT

This paper examines the problem of inferring underground family truth from inconsistent antivirus vendor labels. Our insight is that vendors are not equally reliable, so we construct a two-dimensional probability matrix for each vendor to model its ability to identify diverse families. Then we formalize the inference task as a maximum likelihood estimation problem with hidden random variables and propose a solution based on the expectation-maximization algorithm.

We first evaluate our model on Malgenome, a popular Android dataset with 1,234 samples, the results indicate that our scheme could achieve 93.19% precision with 86.82% recall, which outperforms related work, especially on some unknown and ambiguous families. Then, we build a larger dataset to verify the robustness of our method, which contains 10,165 samples randomly selected from 12 Windows and Linux malware families, and experiment results illustrate that our solution also obtains significant precision and recall.

References

  1. Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In Network and Distributed Systems Security Symposium.Google ScholarGoogle ScholarCross RefCross Ref
  2. Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1007--1015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yongkang Jiang, Shenghong Li, Yue Wu, and Futai Zou. 2019. A Novel Image-Based Malware Classification Model Using Deep Learning. In International Conference on Neural Information Processing. Springer, 150--161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J-Michael Roberts. 2011. VirusShare: Online Malware Repository Project. https://virusshare.comGoogle ScholarGoogle Scholar
  5. Robert Svensson. 2016. DasMalerk.eu: Live Malware Repository. https://dasmalwerk.euGoogle ScholarGoogle Scholar
  6. Nir Nissim, Robert Moskovitch, Lior Rokach, and Yuval Elovici. 2014. Novel active learning methods for enhanced PC malware detection in windows OS. Expert Systems with Applications 41, 13 (2014), 5843--5857.Google ScholarGoogle ScholarCross RefCross Ref
  7. Antonio Nappa, M Zubair Rafique, and Juan Caballero. 2015. The MALICIA dataset: identification and analysis of drive-by download operations. International Journal of Information Security 14, 1 (2015), 15--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuofei Zhu, Jianjun Shi, Limin Yang, Boqin Qin, Ziyi Zhang, Linhai Song, and Gang Wang. 2020. Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines. In USENIX Security Symposium.Google ScholarGoogle Scholar
  9. VirusTotal Team. 2013. VirusTotal-Free Online Virus, Malware and URL Scanner. https://www.virustotal.comGoogle ScholarGoogle Scholar
  10. Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Bena-tallah, and Mohammad Allahbakhsh. 2018. Quality control in crowd-sourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security. 45--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pang Du, Zheyuan Sun, Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. 2018. Statistical estimation of malware detection metrics in the absence of ground truth. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 2965--2980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aziz Mohaisen and Omar Alrawi. 2014. Av-meter: An evaluation of antivirus scans and labels. In International conference on detection of intrusions and malware, and vulnerability assessment. Springer, 112--131.Google ScholarGoogle ScholarCross RefCross Ref
  14. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yajin Zhou and Xuxian Jiang. 2012. Dissecting android malware: Characterization and evolution. In 2012 IEEE symposium on security and privacy. IEEE, 95--109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. Avclass: A tool for massive malware labeling. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarGoogle ScholarCross RefCross Ref
  18. CARO. 1991. A New Virus Naming Convention. http://www.caro.org/articles/naming.htmlGoogle ScholarGoogle Scholar
  19. Stefano Schiavoni, Federico Maggi, Lorenzo Cavallaro, and Stefano Zanero. 2014. Phoenix: DGA-based botnet tracking and intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 192--211.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies
        September 2020
        250 pages
        ISBN:9781450387859
        DOI:10.1145/3422713

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 October 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader