research-article

EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference

Authors:
Yongkang Jiang

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Shenghong Li

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Tong Li

Electric Power Research Institute of State Grid, Liaoning Electric Power Co., Ltd Liaoning, China

Electric Power Research Institute of State Grid, Liaoning Electric Power Co., Ltd Liaoning, China
View Profile

ICBDT '20: Proceedings of the 3rd International Conference on Big Data TechnologiesSeptember 2020Pages 74–79https://doi.org/10.1145/3422713.3422743

Published:23 October 2020Publication History

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies

Pages 74–79

ABSTRACT

This paper examines the problem of inferring underground family truth from inconsistent antivirus vendor labels. Our insight is that vendors are not equally reliable, so we construct a two-dimensional probability matrix for each vendor to model its ability to identify diverse families. Then we formalize the inference task as a maximum likelihood estimation problem with hidden random variables and propose a solution based on the expectation-maximization algorithm.

We first evaluate our model on Malgenome, a popular Android dataset with 1,234 samples, the results indicate that our scheme could achieve 93.19% precision with 86.82% recall, which outperforms related work, especially on some unknown and ambiguous families. Then, we build a larger dataset to verify the robustness of our method, which contains 10,165 samples randomly selected from 12 Windows and Linux malware families, and experiment results illustrate that our solution also obtains significant precision and recall.

References

Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In Network and Distributed Systems Security Symposium.Google ScholarCross Ref
Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1007--1015.Google ScholarDigital Library
Yongkang Jiang, Shenghong Li, Yue Wu, and Futai Zou. 2019. A Novel Image-Based Malware Classification Model Using Deep Learning. In International Conference on Neural Information Processing. Springer, 150--161.Google ScholarDigital Library
J-Michael Roberts. 2011. VirusShare: Online Malware Repository Project. https://virusshare.comGoogle Scholar
Robert Svensson. 2016. DasMalerk.eu: Live Malware Repository. https://dasmalwerk.euGoogle Scholar
Nir Nissim, Robert Moskovitch, Lior Rokach, and Yuval Elovici. 2014. Novel active learning methods for enhanced PC malware detection in windows OS. Expert Systems with Applications 41, 13 (2014), 5843--5857.Google ScholarCross Ref
Antonio Nappa, M Zubair Rafique, and Juan Caballero. 2015. The MALICIA dataset: identification and analysis of drive-by download operations. International Journal of Information Security 14, 1 (2015), 15--33.Google ScholarDigital Library
Shuofei Zhu, Jianjun Shi, Limin Yang, Boqin Qin, Ziyi Zhang, Linhai Song, and Gang Wang. 2020. Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines. In USENIX Security Symposium.Google Scholar
VirusTotal Team. 2013. VirusTotal-Free Online Virus, Malware and URL Scanner. https://www.virustotal.comGoogle Scholar
Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Bena-tallah, and Mohammad Allahbakhsh. 2018. Quality control in crowd-sourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1--40.Google ScholarDigital Library
Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security. 45--56.Google ScholarDigital Library
Pang Du, Zheyuan Sun, Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. 2018. Statistical estimation of malware detection metrics in the absence of ground truth. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 2965--2980.Google ScholarDigital Library
Aziz Mohaisen and Omar Alrawi. 2014. Av-meter: An evaluation of antivirus scans and labels. In International conference on detection of intrusions and malware, and vulnerability assessment. Springer, 112--131.Google ScholarCross Ref
Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20--28.Google ScholarCross Ref
Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552.Google ScholarDigital Library
Yajin Zhou and Xuxian Jiang. 2012. Dissecting android malware: Characterization and evolution. In 2012 IEEE symposium on security and privacy. IEEE, 95--109.Google ScholarDigital Library
Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. Avclass: A tool for massive malware labeling. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarCross Ref
CARO. 1991. A New Virus Naming Convention. http://www.caro.org/articles/naming.htmlGoogle Scholar
Stefano Schiavoni, Federico Maggi, Lorenzo Cavallaro, and Stefano Zanero. 2014. Phoenix: DGA-based botnet tracking and intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 192--211.Google ScholarCross Ref

Index Terms

EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation
2. Theory of computation
  1. Models of computation
    1. Probabilistic computation

Recommendations

Random swap EM algorithm for Gaussian mixture models

Expectation maximization (EM) algorithm is a popular way to estimate the parameters of Gaussian mixture models. Unfortunately, its performance highly depends on the initialization. We propose a random swap EM for the initialization of EM. Instead of ...
Read More
Malicious SSL Certificate Detection: A Step Towards Advanced Persistent Threat Defence
ICFNDS '17: Proceedings of the International Conference on Future Networks and Distributed Systems

Advanced Persistent Threat (APT) is one of the most serious types of cyber attacks, which is a new and more complex version of multistep attack. Within the APT life cycle, continuous communication between infected hosts and Command and Control (C&C) ...
Read More
Formulistic Detection of Malicious Fast-Flux Domains
PAAP '12: Proceedings of the 2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming

Bonnet creates harmful network attacks nowadays. Lawbreaker may implant malware into victim machines using botnets and, furthermore, he employs fast-flux domain technology to improve the lifetime of botnets. To circumvent the detection of command and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies
September 2020
250 pages
ISBN:9781450387859
DOI:10.1145/3422713

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
expectation maximization
family
inference
malware
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 51
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Random swap EM algorithm for Gaussian mixture models

Malicious SSL Certificate Detection: A Step Towards Advanced Persistent Threat Defence

Formulistic Detection of Malicious Fast-Flux Domains

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

EM Meets Malicious Data: A Novel Method for Massive Malware Family Inference

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Random swap EM algorithm for Gaussian mixture models

Malicious SSL Certificate Detection: A Step Towards Advanced Persistent Threat Defence

Formulistic Detection of Malicious Fast-Flux Domains

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media