Are Your Training Datasets Yet Relevant?

Allix, Kevin; Bissyandé, Tegawendé F.; Klein, Jacques; Le Traon, Yves

doi:10.1007/978-3-319-15618-7_5

Kevin Allix¹⁸,
Tegawendé F. Bissyandé¹⁸,
Jacques Klein¹⁸ &
…
Yves Le Traon¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8978))

Included in the following conference series:

International Symposium on Engineering Secure Software and Systems

1529 Accesses
29 Citations

Abstract

In this paper, we consider the relevance of timeline in the construction of datasets, to highlight its impact on the performance of a machine learning-based malware detection scheme. Typically, we show that simply picking a random set of known malware to train a malware detector, as it is done in many assessment scenarios from the literature, yields significantly biased results. In the process of assessing the extent of this impact through various experiments, we were also able to confirm a number of intuitive assumptions about Android malware. For instance, we discuss the existence of Android malware lineages and how they could impact the performance of malware detection in the wild.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection

Article Open access 24 May 2021

Android malware detection using time-aware machine learning approach

Article 15 June 2024

On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities

Article Open access 02 August 2022

References

AppBrain: Number of available android applications, http://www.appbrain.com/stats/number-of-android-apps (accessed: September 09, 2013)
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: A forensic analysis of android malware - how is malware written and how it could be detected? In: Proceedings of the IEEE Computer Software and Applications Conference, COMPSAC (2014)
Google Scholar
Enck, W., Octeau, D., McDaniel, P., Chaudhuri, S.: A study of android application security. In: Proceedings of the 20th USENIX Conference on Security, SEC 2011, San Francisco, CA (2011)
Google Scholar
Pieterse, H., Olivier, M.: Android botnets on the rise: Trends and characteristics. In: Proceedings of the Conference on Information Security for South Africa, ISSA (2012)
Google Scholar
Idika, M.: A survey of malware detection techniques. Technical report, Purdue University (February 2007)
Google Scholar
Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., Rieck, K.: Drebin: Effective and explainable detection of android malware in your pocket. In: Proceedings of the Network and Distributed System Security Symposium, NDSS (2014)
Google Scholar
Chau, D.H., Nachenberg, C., Wilhelm, J., Wright, A., Faloutsos, C.: Polonium: Tera-scale graph mining for malware detection. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2010)
Google Scholar
Boshmaf, Y., Ripeanu, M., Beznosov, K., Zeeuwen, K., Cornell, D., Samosseiko, D.: Augur: Aiding malware detection using large-scale machine learning. In: Proceedings of the 21st Usenix Security Symposium (Poster session) (August 2012)
Google Scholar
Su, X., Chuah, M., Tan, G.: Smartphone dual defense protection framework: Detecting malicious applications in android markets. In: Eighth IEEE International Conference on Mobile Ad-hoc and Sensor Networks, MSN (2012)
Google Scholar
Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of the Sixth International Conference on Data Mining, ICDM 2006, Washington, DC, USA (2006)
Google Scholar
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (December 2006)
Google Scholar
Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious codes detection based on ensemble learning. In: Xiao, B., Yang, L.T., Ma, J., Muller-Schloer, C., Hua, Y. (eds.) ATC 2007. LNCS, vol. 4610, pp. 468–477. Springer, Heidelberg (2007)
Chapter Google Scholar
Sahs, J., Khan, L.: A machine learning approach to android malware detection. In: Proceedings of the IEEE European Intelligence and Security Informatics Conference, EISIC (2012)
Google Scholar
Perdisci, R., Lanzi, A., Lee, W.: Mcboost: Boosting scalability in malware collection and analysis using statistical classification of executables. In: Proceedings of the Annual Computer Security Applications Conference, ACSAC (2008)
Google Scholar
Apvrille, A., Strazzere, T.: Reducing the window of opportunity for android malware gotta catch ’em all. Journal of Computer Virology 8(1-2) (May 2012)
Google Scholar
Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press (2010)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
Google Scholar
Breiman, L.: Random forests. Machine learning 45(1) (2001)
Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the International Machine Learning Conference. Morgan Kaufmann Publishers, Inc. (1995)
Google Scholar
Quinlan, J.R.: C4. 5: programs for machine learning, vol. 1. Morgan Kaufmann (1993)
Google Scholar
Pouik, G.: Similarities for fun & profit. Phrack 14(68) (April 2012), http://www.phrack.org/issues.html?id=15&issue=68
Cesare, S., Xiang, Y.: Classification of malware using structured control flow. In: Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing, AusPDC 2010, vol. 107. Australian Computer Society, Inc., Darlinghurst (2010)
Google Scholar
Allix, K., Bissyandé, T.F., Jerome, Q., Klein, J., State, R., Le Traon, Y.: Empirical assessment of machine learning-based malware detectors for android: Measuring the gap between in-the-lab and in-the-wild validation scenarios. Empirical Software Engineering (to be published, 2015)
Google Scholar
Canfora, G., Mercaldo, F., Visaggio, C.A.: A classifier of malicious android applications. In: Proceedings on the 8th Conference on Availability, Reliability and Security (ARES) (2013)
Google Scholar
Wu, D.J., Mao, C.H., Wei, T.E., Lee, H.M., Wu, K.P.: Droidmat: Android malware detection through manifest and api calls tracing. In: Proceedings of the 7th Asia Joint Conference on Information Security, AsiaJCIS (2012)
Google Scholar
Amos, B., Turner, H., White, J.: Applying machine learning classifiers to dynamic android malware detection at scale. In: Proceedings of 9th International Wireless Communications and Mobile Computing Conference, IWCMC (2013)
Google Scholar
Demme, J., Maycock, M., Schmitz, J., Tang, A., Waksman, A., Sethumadhavan, S., Stolfo, S.: On the feasibility of online malware detection with performance counters. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, ACM, New York (2013)
Google Scholar
Yerima, S., Sezer, S., McWilliams, G., Muttik, I.: A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE International Conference on Advanced Information Networking and Applications, AINA (2013)
Google Scholar
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec (2013)
Google Scholar
Bissyandé, T.F., Thung, F., Wang, S., Lo, D., Jiang, L., Réveillère, L.: Empirical Evaluation of Bug Linking. In: 17th European Conference on Software Maintenance and Reengineering (CSMR 2013), Genova, Italy (March 2013)
Google Scholar
Jones, J.A., Harrold, M.J.: Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE. ACM (2005)
Google Scholar
Hutchins, M., Foster, H., Goradia, T., Ostrand, T.: Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In: Proceedings of the 16th International Conference on Software Engineering, ICSE (1994)
Google Scholar
Rossow, C., Dietrich, C., Grier, C., Kreibich, C., Paxson, V., Pohlmann, N., Bos, H., van Steen, M.: Prudent practices for designing malware experiments: Status quo and outlook. In: 2012 IEEE Symposium on Security and Privacy (SP) (May 2012)
Google Scholar
Böhme, R., Moore, T.: Challenges in empirical security research. Technical report, Singapoore Management University (2012)
Google Scholar
Visaggio, C.A., Pagin, G.A., Canfora, G.: An empirical study of metric-based methods to detect obfuscated code. International Journal of Security & Its Applications 7(2) (2013)
Google Scholar
Aafer, Y., Du, W., Yin, H.: DroidAPIMiner: Mining API-level features for robust malware detection in android. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds.) SecureComm 2013. LNICST, vol. 127, pp. 86–103. Springer, Heidelberg (2013)
Chapter Google Scholar
Barrera, D., Kayacik, H., van Oorschot, P., Somayaji, A.: A methodology for empirical analysis of permission-based security models and its applications to android. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2010)
Google Scholar
Chakradeo, S., Reaves, B., Traynor, P., Enck, W.: Mast: Triage for market-scale mobile malware analysis. In: Proceedings of ACM Conference on Security and Privacy in Wireless and Mobile Networks, WISEC (2013)
Google Scholar
Peng, H., Gates, C.S., Sarma, B.P., Li, N., Qi, Y., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Using probabilistic generative models for rangking risks of android apps. In: Proceedings of ACM Conference on Computer and Communications Security, CCS (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

SnT - University of, Luxembourg
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein & Yves Le Traon

Authors

Kevin Allix
View author publications
You can also search for this author in PubMed Google Scholar
Tegawendé F. Bissyandé
View author publications
You can also search for this author in PubMed Google Scholar
Jacques Klein
View author publications
You can also search for this author in PubMed Google Scholar
Yves Le Traon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

iMinds-DistriNet, KU Leuven, Belgium
Frank Piessens
IMDEA Software Institute, Campus de Montegancedo S/N, 28223, Pozuelo de Alarcón, Spain
Juan Caballero
Inria Sophia Antipolis – Mediterranee, 2004 route des Lucioles, B.P. 93, 06902, Sophia Antipolis Cedex, France
Nataliia Bielova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y. (2015). Are Your Training Datasets Yet Relevant?. In: Piessens, F., Caballero, J., Bielova, N. (eds) Engineering Secure Software and Systems. ESSoS 2015. Lecture Notes in Computer Science, vol 8978. Springer, Cham. https://doi.org/10.1007/978-3-319-15618-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-15618-7_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15617-0
Online ISBN: 978-3-319-15618-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics