research-article

Cloud-based malware detection for evolving data streams

Authors:

Mohammad M. Masud,

Tahseen M. Al-Khateeb,

Kevin W. Hamlen,

Bhavani ThuraisinghamAuthors Info & Claims

ACM Transactions on Management Information Systems (TMIS), Volume 2, Issue 3

Article No.: 16, Pages 1 - 27

https://doi.org/10.1145/2019618.2019622

Published: 18 October 2008 Publication History

Abstract

Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail.

In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.

References

[1]

Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2006. A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Engin. 18, 5, 577--589.

Digital Library

[2]

Aha, D. W., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Mach. Learn. 6, 37--66.

Digital Library

[3]

Apache. 2010. Hadoop. hadoop.apache.org.

[4]

Barford, P. and Yegneswaran, V. 2006. An inside look at botnets. In Malware Detection, Advances in Information Security, M. Christodorescu, S. Jha, D. Maughan, D. Song, and C. Wang, Eds., Springer, 171--192.

[5]

Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., and Gavaldà, R. 2009. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 139--148.

Digital Library

[6]

Boser, B. E., Guyon, I. M., and Vapnik, V. N. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th ACM Workshop on Computational Learning Theory. 144--152.

Digital Library

[7]

Chen, S., Wang, H., Zhou, S., and Yu, P. S. 2008. Stop chasing trends: Discovering high order models in evolving data. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE). 923--932.

Digital Library

[8]

Cohen, W. W. 1996. Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access. 18--25.

[9]

Computer Economics, Inc. 2007. Malware report: The economic impact of viruses, spyware, adware, botnets, and other malicious code. http://www.computereconomics.com/article.cfm?id=1225.

[10]

Crandall, J. R., Su, Z., Wu, S. F., and Chong, F. T. 2005. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS'05). 235--248.

Digital Library

[11]

Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113.

Digital Library

[12]

Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 71--80.

Digital Library

[13]

Fan, W. 2004. Systematic data selection to mine concept-drifting data streams. In Proceedings of the 10th ACM International Conference on Knowledge Discvoery and Data Mining (KDD). 128--137.

Digital Library

[14]

Freund, Y. and Schapire, R. E. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning. 148--156.

[15]

Gao, J., Fan, W., and Han, J. 2007. On appropriate assumptions to mine data streams: Analysis and practice. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). 143--152.

Digital Library

[16]

Grizzard, J. B., Sharma, V., Nunnery, C., Kang, B. B., and Dagon, D. 2007. Peer-to-peer botnets: Overview and case study. In Proceedings of the 1st Workshop on Hot Topics in Understanding Botnets (HotBots). 1--8.

Digital Library

[17]

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. 11, 1, 10--18.

Digital Library

[18]

Hamlen, K. W., Mohan, V., Masud, M. M., Khan, L., and Thuraisingham., B. M. 2009. Exploiting an antivirus interface. Comput. Stand. Interfaces 31, 6, 1182--1189.

Digital Library

[19]

Hashemi, S., Yang, Y., Mirzamomen, Z., and Kangavari, M. R. 2009. Adapted one-versus-all decision trees for data stream classification. IEEE Trans. Knowl. Data Engin. 21, 5, 624--637.

Digital Library

[20]

Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 97--106.

Digital Library

[21]

Kolter, J. and Maloof, M. A. 2004. Learning to detect malicious executables in the wild. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 470--478.

Digital Library

[22]

Kolter, J. Z. and Maloof, M. A. 2005. Using additive expert ensembles to cope with concept drift. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 449--456.

Digital Library

[23]

Lemos, R. 2006. Bot software looks to improve peerage. SecurityFocus. www.securityfocus.com/news/11390.

[24]

Li, Z., Sanghi, M., Chen, Y., Kao, M.-Y., and Chavez, B. 2006. Hamsa: Fast signature generation for zero-day polymorphic worms with provable attack resilience. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 32--47.

Digital Library

[25]

Masud, M. M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. 2008a. Mining concept-drifting data stream to detect peer to peer botnet traffic. Tech. rep. UTDCS-05-08, The University of Texas at Dallas, Richardson, Texas. www.utdallas.edu/~mmm058000/reports/UTDCS-05-08.pdf.

[26]

Masud, M. M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. M. 2009. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD). 363--375.

Digital Library

[27]

Masud, M. M., Khan, L., and Thuraisingham., B. 2008b. A scalable multi-level feature extraction technique to detect malicious executables. Inf. Syst. Frontiers 10, 1, 33--45.

Digital Library

[28]

Michie, D., Spiegelhalter, D. J., and Taylor, C. C., Eds. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Morgan Kaufmann, 50--83.

Digital Library

[29]

Newsome, J., Karp, B., and Song, D. 2005. Polygraph: Automatically generating signatures for polymorphic worms. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 226--241.

Digital Library

[30]

Quinlan, J. R. 2003. C4.5: Programs for Machine Learning 5th Ed. Morgan Kaufmann, San Francisco, CA.

Digital Library

[31]

Rish, I., Grabarnik, G., Cecchi, G. A., Pereira, F., and Gordon, G. J. 2008. Closed-Form supervised dimensionality reduction with generalized linear models. In Proceedings of the 25th ACM International Conference on Machine Learning (ICML). 832--839.

Digital Library

[32]

Sajama and Orlitsky, A. 2005. Supervised dimensionality reduction using mixture models. In Proceedings of the 22nd ACM International Conference on Machine Learning (ICML). 768--775.

Digital Library

[33]

Scholz, M. and Klinkenberg, R. 2005. An ensemble classifier for drifting concepts. In Proceedings of the 2nd International Workshop on Knowledge Discovery in Data Streams (IWKDDS). 53--64.

[34]

Schultz, M. G., Eskin, E., Zadok, E., and Stolfo, S. J. 2001. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 38--49.

Digital Library

[35]

Stewart, J. 2003. Sinit P2P trojan analysis. www.secureworks.com/research/threats/sinit.

[36]

Tumer, K. and Ghosh, J. 1996. Error correlation and error reduction in ensemble classifiers. Connect. Sci. 8, 3, 385--404.

[37]

VX Heavens 2010. VX Heavens. vx.netlux.org.

[38]

Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 226--235.

Digital Library

[39]

Yang, Y., Wu, X., and Zhu, X. 2005. Combining proactive and reactive predictions for data streams. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 710--715.

Digital Library

[40]

Zhang, P., Zhu, X., and Guo, L. 2009. Mining data streams with labeled and unlabeled training examples. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). 627--636.

Digital Library

[41]

Zhao, W., Ma, H., and He, Q. 2009. Parallel K-means clustering based on MapReduce. In Proceedings of the 1st International Conference on Cloud Computing (CloudCom). 674--679.

Digital Library

Cited By

Garg RLaherua MVaghela D(2024)Cloud-Based Malware Detection: ReviewSSRN Electronic Journal10.2139/ssrn.4480742Online publication date: 2024
https://doi.org/10.2139/ssrn.4480742
Thuraisingham BThomas T(2024)Social Media Governance and Fake News Detection Integrated with Artificial Intelligence Governance2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00048(190-197)Online publication date: 7-Aug-2024
https://doi.org/10.1109/IRI62200.2024.00048
Akbar KWang YAyoade GGao YSinghal AKhan LThuraisingham BJee K(2023)Advanced Persistent Threat Detection Using Data Provenance and Metric LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.322178920:5(3957-3969)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TDSC.2022.3221789
Show More Cited By

Index Terms

Cloud-based malware detection for evolving data streams
1. Information systems
  1. Information systems applications
2. Security and privacy
  1. Systems security
    1. Operating systems security

Recommendations

A Survey on Malware Detection Using Data Mining Techniques

In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, ...
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Management Information Systems

ACM Transactions on Management Information Systems Volume 2, Issue 3

October 2011

138 pages

ISSN:2158-656X

EISSN:2158-6578

DOI:10.1145/2019618

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Accepted: 01 August 2011

Revised: 01 July 2011

Received: 01 April 2011

Published: 18 October 2008

Published in TMIS Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,264
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Garg RLaherua MVaghela D(2024)Cloud-Based Malware Detection: ReviewSSRN Electronic Journal10.2139/ssrn.4480742Online publication date: 2024
https://doi.org/10.2139/ssrn.4480742
Thuraisingham BThomas T(2024)Social Media Governance and Fake News Detection Integrated with Artificial Intelligence Governance2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00048(190-197)Online publication date: 7-Aug-2024
https://doi.org/10.1109/IRI62200.2024.00048
Akbar KWang YAyoade GGao YSinghal AKhan LThuraisingham BJee K(2023)Advanced Persistent Threat Detection Using Data Provenance and Metric LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.322178920:5(3957-3969)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TDSC.2022.3221789
Sun DWu JYang JWu H(2021)Intelligent Data Collaboration in Heterogeneous-device IoT PlatformsACM Transactions on Sensor Networks10.1145/342791217:3(1-17)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3427912
Islam MDong BChandra SKhan LThuraisingham B(2020)GCI: A GPU Based Transfer Learning Approach for Detecting Cheats of Computer GameIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.3013817(1-1)Online publication date: 2020
https://doi.org/10.1109/TDSC.2020.3013817
Abawajy JChowdhury MKelarev A(2020)Hybrid Consensus Pruning of Ensemble Classifiers for Big Data Malware DetectionIEEE Transactions on Cloud Computing10.1109/TCC.2015.24813788:2(398-407)Online publication date: 1-Apr-2020
https://doi.org/10.1109/TCC.2015.2481378
Ayoade GAkbar KSahoo PGao YAgarwal AJee KKhan LSinghal A(2020)Evolving Advanced Persistent Threat Detection using Provenance Graph and Metric Learning2020 IEEE Conference on Communications and Network Security (CNS)10.1109/CNS48642.2020.9162264(1-9)Online publication date: Jun-2020
https://doi.org/10.1109/CNS48642.2020.9162264
Thuraisingham B(2020)Cloud Governance2020 IEEE 13th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD49709.2020.00025(86-90)Online publication date: Oct-2020
https://doi.org/10.1109/CLOUD49709.2020.00025
Babaagba KTan ZHart E(2020)Improving Classification of Metamorphic Malware by Augmenting Training Data with a Diverse Set of Evolved Mutant Samples2020 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC48606.2020.9185668(1-7)Online publication date: Jul-2020
https://doi.org/10.1109/CEC48606.2020.9185668
Komatwar RKokare M(2020)Customized Convolutional Neural Networks with K-Nearest Neighbor Classification System for Malware CategorizationJournal of Applied Security Research10.1080/19361610.2020.1718990(1-21)Online publication date: 1-Apr-2020
https://doi.org/10.1080/19361610.2020.1718990
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents