Abstract
In this paper, we consider the relevance of timeline in the construction of datasets, to highlight its impact on the performance of a machine learning-based malware detection scheme. Typically, we show that simply picking a random set of known malware to train a malware detector, as it is done in many assessment scenarios from the literature, yields significantly biased results. In the process of assessing the extent of this impact through various experiments, we were also able to confirm a number of intuitive assumptions about Android malware. For instance, we discuss the existence of Android malware lineages and how they could impact the performance of malware detection in the wild.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AppBrain: Number of available android applications, http://www.appbrain.com/stats/number-of-android-apps (accessed: September 09, 2013)
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: A forensic analysis of android malware - how is malware written and how it could be detected? In: Proceedings of the IEEE Computer Software and Applications Conference, COMPSAC (2014)
Enck, W., Octeau, D., McDaniel, P., Chaudhuri, S.: A study of android application security. In: Proceedings of the 20th USENIX Conference on Security, SEC 2011, San Francisco, CA (2011)
Pieterse, H., Olivier, M.: Android botnets on the rise: Trends and characteristics. In: Proceedings of the Conference on Information Security for South Africa, ISSA (2012)
Idika, M.: A survey of malware detection techniques. Technical report, Purdue University (February 2007)
Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., Rieck, K.: Drebin: Effective and explainable detection of android malware in your pocket. In: Proceedings of the Network and Distributed System Security Symposium, NDSS (2014)
Chau, D.H., Nachenberg, C., Wilhelm, J., Wright, A., Faloutsos, C.: Polonium: Tera-scale graph mining for malware detection. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2010)
Boshmaf, Y., Ripeanu, M., Beznosov, K., Zeeuwen, K., Cornell, D., Samosseiko, D.: Augur: Aiding malware detection using large-scale machine learning. In: Proceedings of the 21st Usenix Security Symposium (Poster session) (August 2012)
Su, X., Chuah, M., Tan, G.: Smartphone dual defense protection framework: Detecting malicious applications in android markets. In: Eighth IEEE International Conference on Mobile Ad-hoc and Sensor Networks, MSN (2012)
Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of the Sixth International Conference on Data Mining, ICDM 2006, Washington, DC, USA (2006)
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (December 2006)
Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious codes detection based on ensemble learning. In: Xiao, B., Yang, L.T., Ma, J., Muller-Schloer, C., Hua, Y. (eds.) ATC 2007. LNCS, vol. 4610, pp. 468–477. Springer, Heidelberg (2007)
Sahs, J., Khan, L.: A machine learning approach to android malware detection. In: Proceedings of the IEEE European Intelligence and Security Informatics Conference, EISIC (2012)
Perdisci, R., Lanzi, A., Lee, W.: Mcboost: Boosting scalability in malware collection and analysis using statistical classification of executables. In: Proceedings of the Annual Computer Security Applications Conference, ACSAC (2008)
Apvrille, A., Strazzere, T.: Reducing the window of opportunity for android malware gotta catch ’em all. Journal of Computer Virology 8(1-2) (May 2012)
Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press (2010)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
Breiman, L.: Random forests. Machine learning 45(1) (2001)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the International Machine Learning Conference. Morgan Kaufmann Publishers, Inc. (1995)
Quinlan, J.R.: C4. 5: programs for machine learning, vol. 1. Morgan Kaufmann (1993)
Pouik, G.: Similarities for fun & profit. Phrack 14(68) (April 2012), http://www.phrack.org/issues.html?id=15&issue=68
Cesare, S., Xiang, Y.: Classification of malware using structured control flow. In: Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing, AusPDC 2010, vol. 107. Australian Computer Society, Inc., Darlinghurst (2010)
Allix, K., Bissyandé, T.F., Jerome, Q., Klein, J., State, R., Le Traon, Y.: Empirical assessment of machine learning-based malware detectors for android: Measuring the gap between in-the-lab and in-the-wild validation scenarios. Empirical Software Engineering (to be published, 2015)
Canfora, G., Mercaldo, F., Visaggio, C.A.: A classifier of malicious android applications. In: Proceedings on the 8th Conference on Availability, Reliability and Security (ARES) (2013)
Wu, D.J., Mao, C.H., Wei, T.E., Lee, H.M., Wu, K.P.: Droidmat: Android malware detection through manifest and api calls tracing. In: Proceedings of the 7th Asia Joint Conference on Information Security, AsiaJCIS (2012)
Amos, B., Turner, H., White, J.: Applying machine learning classifiers to dynamic android malware detection at scale. In: Proceedings of 9th International Wireless Communications and Mobile Computing Conference, IWCMC (2013)
Demme, J., Maycock, M., Schmitz, J., Tang, A., Waksman, A., Sethumadhavan, S., Stolfo, S.: On the feasibility of online malware detection with performance counters. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, ACM, New York (2013)
Yerima, S., Sezer, S., McWilliams, G., Muttik, I.: A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE International Conference on Advanced Information Networking and Applications, AINA (2013)
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec (2013)
Bissyandé, T.F., Thung, F., Wang, S., Lo, D., Jiang, L., Réveillère, L.: Empirical Evaluation of Bug Linking. In: 17th European Conference on Software Maintenance and Reengineering (CSMR 2013), Genova, Italy (March 2013)
Jones, J.A., Harrold, M.J.: Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE. ACM (2005)
Hutchins, M., Foster, H., Goradia, T., Ostrand, T.: Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In: Proceedings of the 16th International Conference on Software Engineering, ICSE (1994)
Rossow, C., Dietrich, C., Grier, C., Kreibich, C., Paxson, V., Pohlmann, N., Bos, H., van Steen, M.: Prudent practices for designing malware experiments: Status quo and outlook. In: 2012 IEEE Symposium on Security and Privacy (SP) (May 2012)
Böhme, R., Moore, T.: Challenges in empirical security research. Technical report, Singapoore Management University (2012)
Visaggio, C.A., Pagin, G.A., Canfora, G.: An empirical study of metric-based methods to detect obfuscated code. International Journal of Security & Its Applications 7(2) (2013)
Aafer, Y., Du, W., Yin, H.: DroidAPIMiner: Mining API-level features for robust malware detection in android. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds.) SecureComm 2013. LNICST, vol. 127, pp. 86–103. Springer, Heidelberg (2013)
Barrera, D., Kayacik, H., van Oorschot, P., Somayaji, A.: A methodology for empirical analysis of permission-based security models and its applications to android. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2010)
Chakradeo, S., Reaves, B., Traynor, P., Enck, W.: Mast: Triage for market-scale mobile malware analysis. In: Proceedings of ACM Conference on Security and Privacy in Wireless and Mobile Networks, WISEC (2013)
Peng, H., Gates, C.S., Sarma, B.P., Li, N., Qi, Y., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Using probabilistic generative models for rangking risks of android apps. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y. (2015). Are Your Training Datasets Yet Relevant?. In: Piessens, F., Caballero, J., Bielova, N. (eds) Engineering Secure Software and Systems. ESSoS 2015. Lecture Notes in Computer Science, vol 8978. Springer, Cham. https://doi.org/10.1007/978-3-319-15618-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-15618-7_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15617-0
Online ISBN: 978-3-319-15618-7
eBook Packages: Computer ScienceComputer Science (R0)