Seven Pitfalls of Using Data Science in Cybersecurity

Johnstone, Mike; Peacock, Matt

doi:10.1007/978-3-030-38788-4_6

Mike Johnstone⁵ &
Matt Peacock⁶

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 177))

1011 Accesses
8 Citations

Abstract

Machine learning, a subset of artificial intelligence, is used for many problems where a data-driven approach is required and the problem space involves either classification or prediction. The hype surrounding machine learning, coupled with the ease of use of machine learning tools can lead to a (mistaken) belief that machine learning is a panacea for all problems and simply feeding large volumes of data to an algorithm will generate a sensible and usable answer. In this chapter, we explore several pitfalls that a data scientist must evaluate in order to obtain some tangible meaning from the results provided by a machine learning algorithm. There is some evidence to suggest that algorithm choice is not a discriminator. In particular, we explore the importance of feature set selection and evaluate the inherent problems in relying on synthetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 26–33. https://doi.org/10.3115/1073012.1073017
Boutaba R, Salahuddin M, Limam N, Ayoubi S, Shahriar N, Estrada-Solano F, Caicedo Rendon O (2018) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. J Internet Serv Appl 9. https://doi.org/10.1186/s13174-018-0087-2
Brooks FP Jr (1987) No silver bullet essence and accidents of software engineering. IEEE Comput 20(4):10–19. https://doi.org/10.1109/MC.1987.1663532
Article Google Scholar
Chicco D (2017) Ten quick tips for machine learning in computational biology. BioData Min 10(35). https://doi.org/10.1186/s13040-017-0155-3
Curran JR, Osborne M (2002) A very very large corpus doesn’t always yield reliable estimates. In: Proceedings of the 6th conference on natural language learning—Volume 20. Association for Computational Linguistics, Stroudsburg, PA, USA. https://doi.org/10.3115/1118853.1118861
Falkenberg E, Hesse W, Lindgreen P, Nilsson B, Han Oei J, Rolland C, Stamper R, van Assche F, Verrijn-Stuart A, Voss K (1998) FRISCO: a framework of information system concepts: the FRISCO report (WEB Edition). International Federation for Information Processing
Google Scholar
Fraser S, Mancl D (2008) No silver bullet: software engineering reloaded. IEEE Softw 25:91–94. https://doi.org/10.1109/MS.2008.14
Article Google Scholar
Gharib A, Sharafaldin I, Lashkari AH, Ghorbani AA (2016) An evaluation framework for intrusion detection dataset. In: 2016 International Conference on Information Science and Security. https://doi.org/10.1109/ICISSEC.2016.7885840
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE 11(4). https://doi.org/10.1371/journal.pone.0152173
Article Google Scholar
Hand D, Christen P (2018) A note on using the F-measure for evaluating record linkage algorithms. Stat Comput 28(3):539–547. https://doi.org/10.1007/s11222-017-9746-6
Article MathSciNet MATH Google Scholar
Hentschel C, Sack H (2014) Does one size really fit all?: Evaluating classifiers in bag-of-visual-words classification. In: Proceedings of the 14th International Conference on Knowledge Technologies and Data-Driven Business. ACM, New York. pp 7:1–7:8. https://doi.org/10.1145/2637748.2638424
Huang L, Joseph AD, Nelson B, Rubinstein BI, Tygar JD (2011) Adversarial machine learning. In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. ACM, New York, pp 43–58. https://doi.org/10.1145/2046684.2046692
Kitchenham BA (1996) Evaluating software engineering methods and tool Part 1: The evaluation context and evaluation methods. SIGSOFT Softw Eng Notes 21(1):11–14. https://doi.org/10.1145/381790.381795
Article Google Scholar
Korzybski A (1936) The extensional method. In: Alfred Korzybski: Collected writings 1920–1950. Institute of General Semantics, pp 239–244
Google Scholar
Laskov P, Kloft M (2009) A framework for quantitative security analysis of machine learning. In: Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence. ACM, New York. https://doi.org/10.1145/1654988.1654990
Liu Q, Li P, Zhao W, Cai W, Yu S, Leung V (2018) A survey on security threats and defensive techniques of machine learning: a data driven view. IEEE Access 6:12,103–12,117. https://doi.org/10.1109/ACCESS.2018.2805680
Article Google Scholar
Liu WK, Karniadakis G, Tang S, Yvonnet J (2019) A computational mechanics special issue on data-driven modeling and simulation—theory, methods, and applications. Comput Mech 64(2):275–277. https://doi.org/10.1007/s00466-019-01741-z
Article MathSciNet MATH Google Scholar
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta Protein Struct 405(2):442–451. https://doi.org/10.1016/0005-2795(75)90109-9
Article Google Scholar
Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63
MathSciNet Google Scholar
Song C, Pons A, Yen K (2018) AA-HMM: an anti-adversarial hidden Markov model for network-based intrusion detection. Appl Sci 8(12). https://doi.org/10.3390/app8122421
Article Google Scholar
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD Cup 99 data set. In: IEEE symposium on computational intelligence for security and defense applications. IEEE. https://doi.org/10.1109/CISDA.2009.5356528
Ucci D, Aniello L, Baldoni R (2019) Survey of machine learning techniques for malware analysis. Comput Secur 81:123–147. https://doi.org/10.1016/j.cose.2018.11.001
Article Google Scholar
Vishwanath KV, Vahdat A (2006) Realistic and responsive network traffic generation. SIGCOMM Comput Commun Rev 36(4):111–122. https://doi.org/10.1145/1151659.1159928
Article Google Scholar
Wand Y, Weber R (1993) On the ontological expressiveness of information systems analysis and design grammars. Inf Syst J 3(4):217–237. https://doi.org/10.1111/j.1365-2575.1993.tb00127.x
Article Google Scholar
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390. https://doi.org/10.1162/neco.1996.8.7.1341
Article Google Scholar

Download references

Author information

Authors and Affiliations

Edith Cowan University, Perth, Australia
Mike Johnstone
Sapien Cyber, Perth, Australia
Matt Peacock

Authors

Mike Johnstone
View author publications
You can also search for this author in PubMed Google Scholar
Matt Peacock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mike Johnstone .

Editor information

Editors and Affiliations

School of Science, Edith Cowan University, Joondalup, WA, Australia
Leslie F. Sikos
Department of Information Systems and Security, University of Texas at San Antonio, San Antonio, TX, USA
Kim-Kwang Raymond Choo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Johnstone, M., Peacock, M. (2020). Seven Pitfalls of Using Data Science in Cybersecurity. In: Sikos, L., Choo, KK. (eds) Data Science in Cybersecurity and Cyberthreat Intelligence. Intelligent Systems Reference Library, vol 177. Springer, Cham. https://doi.org/10.1007/978-3-030-38788-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-38788-4_6
Published: 06 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38787-7
Online ISBN: 978-3-030-38788-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics