skip to main content
research-article

Cloud-based malware detection for evolving data streams

Published: 18 October 2008 Publication History

Abstract

Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail.
In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.

References

[1]
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2006. A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Engin. 18, 5, 577--589.
[2]
Aha, D. W., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Mach. Learn. 6, 37--66.
[3]
Apache. 2010. Hadoop. hadoop.apache.org.
[4]
Barford, P. and Yegneswaran, V. 2006. An inside look at botnets. In Malware Detection, Advances in Information Security, M. Christodorescu, S. Jha, D. Maughan, D. Song, and C. Wang, Eds., Springer, 171--192.
[5]
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., and Gavaldà, R. 2009. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 139--148.
[6]
Boser, B. E., Guyon, I. M., and Vapnik, V. N. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th ACM Workshop on Computational Learning Theory. 144--152.
[7]
Chen, S., Wang, H., Zhou, S., and Yu, P. S. 2008. Stop chasing trends: Discovering high order models in evolving data. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE). 923--932.
[8]
Cohen, W. W. 1996. Learning rules that classify e-mail. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access. 18--25.
[9]
Computer Economics, Inc. 2007. Malware report: The economic impact of viruses, spyware, adware, botnets, and other malicious code. http://www.computereconomics.com/article.cfm?id=1225.
[10]
Crandall, J. R., Su, Z., Wu, S. F., and Chong, F. T. 2005. On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS'05). 235--248.
[11]
Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113.
[12]
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 71--80.
[13]
Fan, W. 2004. Systematic data selection to mine concept-drifting data streams. In Proceedings of the 10th ACM International Conference on Knowledge Discvoery and Data Mining (KDD). 128--137.
[14]
Freund, Y. and Schapire, R. E. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning. 148--156.
[15]
Gao, J., Fan, W., and Han, J. 2007. On appropriate assumptions to mine data streams: Analysis and practice. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). 143--152.
[16]
Grizzard, J. B., Sharma, V., Nunnery, C., Kang, B. B., and Dagon, D. 2007. Peer-to-peer botnets: Overview and case study. In Proceedings of the 1st Workshop on Hot Topics in Understanding Botnets (HotBots). 1--8.
[17]
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. 11, 1, 10--18.
[18]
Hamlen, K. W., Mohan, V., Masud, M. M., Khan, L., and Thuraisingham., B. M. 2009. Exploiting an antivirus interface. Comput. Stand. Interfaces 31, 6, 1182--1189.
[19]
Hashemi, S., Yang, Y., Mirzamomen, Z., and Kangavari, M. R. 2009. Adapted one-versus-all decision trees for data stream classification. IEEE Trans. Knowl. Data Engin. 21, 5, 624--637.
[20]
Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 97--106.
[21]
Kolter, J. and Maloof, M. A. 2004. Learning to detect malicious executables in the wild. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 470--478.
[22]
Kolter, J. Z. and Maloof, M. A. 2005. Using additive expert ensembles to cope with concept drift. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 449--456.
[23]
Lemos, R. 2006. Bot software looks to improve peerage. SecurityFocus. www.securityfocus.com/news/11390.
[24]
Li, Z., Sanghi, M., Chen, Y., Kao, M.-Y., and Chavez, B. 2006. Hamsa: Fast signature generation for zero-day polymorphic worms with provable attack resilience. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 32--47.
[25]
Masud, M. M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. 2008a. Mining concept-drifting data stream to detect peer to peer botnet traffic. Tech. rep. UTDCS-05-08, The University of Texas at Dallas, Richardson, Texas. www.utdallas.edu/~mmm058000/reports/UTDCS-05-08.pdf.
[26]
Masud, M. M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. M. 2009. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD). 363--375.
[27]
Masud, M. M., Khan, L., and Thuraisingham., B. 2008b. A scalable multi-level feature extraction technique to detect malicious executables. Inf. Syst. Frontiers 10, 1, 33--45.
[28]
Michie, D., Spiegelhalter, D. J., and Taylor, C. C., Eds. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Morgan Kaufmann, 50--83.
[29]
Newsome, J., Karp, B., and Song, D. 2005. Polygraph: Automatically generating signatures for polymorphic worms. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 226--241.
[30]
Quinlan, J. R. 2003. C4.5: Programs for Machine Learning 5th Ed. Morgan Kaufmann, San Francisco, CA.
[31]
Rish, I., Grabarnik, G., Cecchi, G. A., Pereira, F., and Gordon, G. J. 2008. Closed-Form supervised dimensionality reduction with generalized linear models. In Proceedings of the 25th ACM International Conference on Machine Learning (ICML). 832--839.
[32]
Sajama and Orlitsky, A. 2005. Supervised dimensionality reduction using mixture models. In Proceedings of the 22nd ACM International Conference on Machine Learning (ICML). 768--775.
[33]
Scholz, M. and Klinkenberg, R. 2005. An ensemble classifier for drifting concepts. In Proceedings of the 2nd International Workshop on Knowledge Discovery in Data Streams (IWKDDS). 53--64.
[34]
Schultz, M. G., Eskin, E., Zadok, E., and Stolfo, S. J. 2001. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). 38--49.
[35]
Stewart, J. 2003. Sinit P2P trojan analysis. www.secureworks.com/research/threats/sinit.
[36]
Tumer, K. and Ghosh, J. 1996. Error correlation and error reduction in ensemble classifiers. Connect. Sci. 8, 3, 385--404.
[37]
VX Heavens 2010. VX Heavens. vx.netlux.org.
[38]
Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 226--235.
[39]
Yang, Y., Wu, X., and Zhu, X. 2005. Combining proactive and reactive predictions for data streams. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 710--715.
[40]
Zhang, P., Zhu, X., and Guo, L. 2009. Mining data streams with labeled and unlabeled training examples. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). 627--636.
[41]
Zhao, W., Ma, H., and He, Q. 2009. Parallel K-means clustering based on MapReduce. In Proceedings of the 1st International Conference on Cloud Computing (CloudCom). 674--679.

Cited By

View all
  • (2024)Cloud-Based Malware Detection: ReviewSSRN Electronic Journal10.2139/ssrn.4480742Online publication date: 2024
  • (2024)Social Media Governance and Fake News Detection Integrated with Artificial Intelligence Governance2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00048(190-197)Online publication date: 7-Aug-2024
  • (2023)Advanced Persistent Threat Detection Using Data Provenance and Metric LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.322178920:5(3957-3969)Online publication date: 1-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Management Information Systems
ACM Transactions on Management Information Systems  Volume 2, Issue 3
October 2011
138 pages
ISSN:2158-656X
EISSN:2158-6578
DOI:10.1145/2019618
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Accepted: 01 August 2011
Revised: 01 July 2011
Received: 01 April 2011
Published: 18 October 2008
Published in TMIS Volume 2, Issue 3

Permissions

Request permissions for this article.

Author Tags

  1. Data mining
  2. data streams
  3. malicious executable
  4. malware detection
  5. n-gram analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cloud-Based Malware Detection: ReviewSSRN Electronic Journal10.2139/ssrn.4480742Online publication date: 2024
  • (2024)Social Media Governance and Fake News Detection Integrated with Artificial Intelligence Governance2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00048(190-197)Online publication date: 7-Aug-2024
  • (2023)Advanced Persistent Threat Detection Using Data Provenance and Metric LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.322178920:5(3957-3969)Online publication date: 1-Sep-2023
  • (2021)Intelligent Data Collaboration in Heterogeneous-device IoT PlatformsACM Transactions on Sensor Networks10.1145/342791217:3(1-17)Online publication date: 21-Jun-2021
  • (2020)GCI: A GPU Based Transfer Learning Approach for Detecting Cheats of Computer GameIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.3013817(1-1)Online publication date: 2020
  • (2020)Hybrid Consensus Pruning of Ensemble Classifiers for Big Data Malware DetectionIEEE Transactions on Cloud Computing10.1109/TCC.2015.24813788:2(398-407)Online publication date: 1-Apr-2020
  • (2020)Evolving Advanced Persistent Threat Detection using Provenance Graph and Metric Learning2020 IEEE Conference on Communications and Network Security (CNS)10.1109/CNS48642.2020.9162264(1-9)Online publication date: Jun-2020
  • (2020)Cloud Governance2020 IEEE 13th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD49709.2020.00025(86-90)Online publication date: Oct-2020
  • (2020)Improving Classification of Metamorphic Malware by Augmenting Training Data with a Diverse Set of Evolved Mutant Samples2020 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC48606.2020.9185668(1-7)Online publication date: Jul-2020
  • (2020)Customized Convolutional Neural Networks with K-Nearest Neighbor Classification System for Malware CategorizationJournal of Applied Security Research10.1080/19361610.2020.1718990(1-21)Online publication date: 1-Apr-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media