Skip to main content

Are Your Training Datasets Yet Relevant?

An Investigation into the Importance of Timeline in Machine Learning-Based Malware Detection

  • Conference paper
Engineering Secure Software and Systems (ESSoS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8978))

Included in the following conference series:

Abstract

In this paper, we consider the relevance of timeline in the construction of datasets, to highlight its impact on the performance of a machine learning-based malware detection scheme. Typically, we show that simply picking a random set of known malware to train a malware detector, as it is done in many assessment scenarios from the literature, yields significantly biased results. In the process of assessing the extent of this impact through various experiments, we were also able to confirm a number of intuitive assumptions about Android malware. For instance, we discuss the existence of Android malware lineages and how they could impact the performance of malware detection in the wild.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. AppBrain: Number of available android applications, http://www.appbrain.com/stats/number-of-android-apps (accessed: September 09, 2013)

  2. Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: A forensic analysis of android malware - how is malware written and how it could be detected? In: Proceedings of the IEEE Computer Software and Applications Conference, COMPSAC (2014)

    Google Scholar 

  3. Enck, W., Octeau, D., McDaniel, P., Chaudhuri, S.: A study of android application security. In: Proceedings of the 20th USENIX Conference on Security, SEC 2011, San Francisco, CA (2011)

    Google Scholar 

  4. Pieterse, H., Olivier, M.: Android botnets on the rise: Trends and characteristics. In: Proceedings of the Conference on Information Security for South Africa, ISSA (2012)

    Google Scholar 

  5. Idika, M.: A survey of malware detection techniques. Technical report, Purdue University (February 2007)

    Google Scholar 

  6. Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., Rieck, K.: Drebin: Effective and explainable detection of android malware in your pocket. In: Proceedings of the Network and Distributed System Security Symposium, NDSS (2014)

    Google Scholar 

  7. Chau, D.H., Nachenberg, C., Wilhelm, J., Wright, A., Faloutsos, C.: Polonium: Tera-scale graph mining for malware detection. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2010)

    Google Scholar 

  8. Boshmaf, Y., Ripeanu, M., Beznosov, K., Zeeuwen, K., Cornell, D., Samosseiko, D.: Augur: Aiding malware detection using large-scale machine learning. In: Proceedings of the 21st Usenix Security Symposium (Poster session) (August 2012)

    Google Scholar 

  9. Su, X., Chuah, M., Tan, G.: Smartphone dual defense protection framework: Detecting malicious applications in android markets. In: Eighth IEEE International Conference on Mobile Ad-hoc and Sensor Networks, MSN (2012)

    Google Scholar 

  10. Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of the Sixth International Conference on Data Mining, ICDM 2006, Washington, DC, USA (2006)

    Google Scholar 

  11. Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (December 2006)

    Google Scholar 

  12. Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious codes detection based on ensemble learning. In: Xiao, B., Yang, L.T., Ma, J., Muller-Schloer, C., Hua, Y. (eds.) ATC 2007. LNCS, vol. 4610, pp. 468–477. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  13. Sahs, J., Khan, L.: A machine learning approach to android malware detection. In: Proceedings of the IEEE European Intelligence and Security Informatics Conference, EISIC (2012)

    Google Scholar 

  14. Perdisci, R., Lanzi, A., Lee, W.: Mcboost: Boosting scalability in malware collection and analysis using statistical classification of executables. In: Proceedings of the Annual Computer Security Applications Conference, ACSAC (2008)

    Google Scholar 

  15. Apvrille, A., Strazzere, T.: Reducing the window of opportunity for android malware gotta catch ’em all. Journal of Computer Virology 8(1-2) (May 2012)

    Google Scholar 

  16. Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press (2010)

    Google Scholar 

  17. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)

    Google Scholar 

  18. Breiman, L.: Random forests. Machine learning 45(1) (2001)

    Google Scholar 

  19. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the International Machine Learning Conference. Morgan Kaufmann Publishers, Inc. (1995)

    Google Scholar 

  20. Quinlan, J.R.: C4. 5: programs for machine learning, vol. 1. Morgan Kaufmann (1993)

    Google Scholar 

  21. Pouik, G.: Similarities for fun & profit. Phrack 14(68) (April 2012), http://www.phrack.org/issues.html?id=15&issue=68

  22. Cesare, S., Xiang, Y.: Classification of malware using structured control flow. In: Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing, AusPDC 2010, vol. 107. Australian Computer Society, Inc., Darlinghurst (2010)

    Google Scholar 

  23. Allix, K., Bissyandé, T.F., Jerome, Q., Klein, J., State, R., Le Traon, Y.: Empirical assessment of machine learning-based malware detectors for android: Measuring the gap between in-the-lab and in-the-wild validation scenarios. Empirical Software Engineering (to be published, 2015)

    Google Scholar 

  24. Canfora, G., Mercaldo, F., Visaggio, C.A.: A classifier of malicious android applications. In: Proceedings on the 8th Conference on Availability, Reliability and Security (ARES) (2013)

    Google Scholar 

  25. Wu, D.J., Mao, C.H., Wei, T.E., Lee, H.M., Wu, K.P.: Droidmat: Android malware detection through manifest and api calls tracing. In: Proceedings of the 7th Asia Joint Conference on Information Security, AsiaJCIS (2012)

    Google Scholar 

  26. Amos, B., Turner, H., White, J.: Applying machine learning classifiers to dynamic android malware detection at scale. In: Proceedings of 9th International Wireless Communications and Mobile Computing Conference, IWCMC (2013)

    Google Scholar 

  27. Demme, J., Maycock, M., Schmitz, J., Tang, A., Waksman, A., Sethumadhavan, S., Stolfo, S.: On the feasibility of online malware detection with performance counters. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, ACM, New York (2013)

    Google Scholar 

  28. Yerima, S., Sezer, S., McWilliams, G., Muttik, I.: A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE International Conference on Advanced Information Networking and Applications, AINA (2013)

    Google Scholar 

  29. Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec (2013)

    Google Scholar 

  30. Bissyandé, T.F., Thung, F., Wang, S., Lo, D., Jiang, L., Réveillère, L.: Empirical Evaluation of Bug Linking. In: 17th European Conference on Software Maintenance and Reengineering (CSMR 2013), Genova, Italy (March 2013)

    Google Scholar 

  31. Jones, J.A., Harrold, M.J.: Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE. ACM (2005)

    Google Scholar 

  32. Hutchins, M., Foster, H., Goradia, T., Ostrand, T.: Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In: Proceedings of the 16th International Conference on Software Engineering, ICSE (1994)

    Google Scholar 

  33. Rossow, C., Dietrich, C., Grier, C., Kreibich, C., Paxson, V., Pohlmann, N., Bos, H., van Steen, M.: Prudent practices for designing malware experiments: Status quo and outlook. In: 2012 IEEE Symposium on Security and Privacy (SP) (May 2012)

    Google Scholar 

  34. Böhme, R., Moore, T.: Challenges in empirical security research. Technical report, Singapoore Management University (2012)

    Google Scholar 

  35. Visaggio, C.A., Pagin, G.A., Canfora, G.: An empirical study of metric-based methods to detect obfuscated code. International Journal of Security & Its Applications 7(2) (2013)

    Google Scholar 

  36. Aafer, Y., Du, W., Yin, H.: DroidAPIMiner: Mining API-level features for robust malware detection in android. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds.) SecureComm 2013. LNICST, vol. 127, pp. 86–103. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  37. Barrera, D., Kayacik, H., van Oorschot, P., Somayaji, A.: A methodology for empirical analysis of permission-based security models and its applications to android. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2010)

    Google Scholar 

  38. Chakradeo, S., Reaves, B., Traynor, P., Enck, W.: Mast: Triage for market-scale mobile malware analysis. In: Proceedings of ACM Conference on Security and Privacy in Wireless and Mobile Networks, WISEC (2013)

    Google Scholar 

  39. Peng, H., Gates, C.S., Sarma, B.P., Li, N., Qi, Y., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Using probabilistic generative models for rangking risks of android apps. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y. (2015). Are Your Training Datasets Yet Relevant?. In: Piessens, F., Caballero, J., Bielova, N. (eds) Engineering Secure Software and Systems. ESSoS 2015. Lecture Notes in Computer Science, vol 8978. Springer, Cham. https://doi.org/10.1007/978-3-319-15618-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-15618-7_5

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-15617-0

  • Online ISBN: 978-3-319-15618-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics