Abstract
In order to assist security analysts in obtaining information pertaining to the cybersecurity tailored to the security domain are needed. Since labeled text data is scarce and expensive, Named Entity Recognition (NER) is used to detect the relevant domain entities from the raw text. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities. Our previous work proposed a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. This method requires small dictionary that has the pairs of keywords and their categories, and text data. However, the semantic similarity measurement in the method to solve the ambiguous keywords requires the specific category names even if non cybersecurity related categories. In this work, we introduce another semantic similarity measurement using text category classifier which does not require to give the specific non cybersecurity related category name. We compare the performance of the two semantic similarity measurements, and the new measurement performs better. The experimental evaluation result shows that our method with the training data that is annotated by small dictionary performs almost same performance of the models that are trained with fully annotated data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, 20–26 August, 2018, pp. 1638–1649 (2018)
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785 (2019)
Bridges, R.A., Jones, C.L., Iannacone, M.D., Goodall, J.R.: Automatic labeling for entity extraction in cyber security. CoRR, abs/1308.4941 (2013)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Dong, Y., Guo, W., Chen, Y., Xing, X., Zhang, Y., Wang, G.: Towards the detection of inconsistencies in public security vulnerability reports. In: Heninger, N., Traynor, P., (eds.) 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, 14–16 August, 2019, pp. 869–885. USENIX Association (2019)
Gasmi, H., Bouras, A., Laval, J.: Lstm recurrent neural networks for cybersecurity named entity recognition. In: ICSEA 2018, p. 11 (2018)
Gasmi, H., Bouras, A., Laval, J.: LSTM recurrent neural networks for cybersecurity named entity recognition. ICSEA 11, 2018 (2018)
Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, 26–31 May, 2013, pp. 6645–6649. IEEE (2013)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991 (2015)
Jones, C.L., Bridges, R.A., Huffer, K.M.T., Goodall, J.R.: Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR 2015, Oak Ridge, TN, USA, 7–9 April, 2015, pp. 11:1–11:4 (2015)
Kashihara, K., Shakarian, J., Baral, C.: Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 347–361. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2_28
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pp. 260–270 (2016)
Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013)
Mulwad, V., Li, W., Joshi, A., Finin, T., Viswanathan, K.: Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011, Campus Scientifique de la Doua, Lyon, France, 22–27 August, 2011, pp. 257–260 (2011)
Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 365–371 (2015)
Vinayakumar R., Alazab, M., Jolfaei, A., Soman, K.P., Poornachandran, P.: Ransomware triage using deep learning: Twitter as a case study. In Cybersecurity and Cyberforensics Conference, CCC 2019, Melbourne, Australia, 8–9 May, 2019, pp. 67–73. IEEE (2019)
Vinayakumar, R., Alazab, M., Srinivasan, S., Pham, Q.-V., Padannayil, S., Ketha, S.: A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Trans. Ind. Appl. 1–1, January 2020
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the conll-2000 shared task chunking. In: Fourth Conference on Computational Natural Language Learning, CoNLL 2000, and the Second Learning Language in Logic Workshop, LLL 2000, Held in cooperation with ICGI-2000, Lisbon, Portugal, September 13–14, 2000, pp. 127–132 (2000)
Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 8749–8757. AAAI Press (2020)
Simran, K., Sriram, S., Vinayakumar, R., Soman, K.P.: Deep learning approach for intelligent named entity recognition of cyber security. In: International Symposium on Signal Processing and Intelligent Recognition Systems, pp. 163–172. Springer (2019)
Sirotina, A., Loukachevitch, N.: Named entity recognition in information security domain for Russian. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1114–1120, Varna, Bulgaria, September 2019. INCOMA Ltd (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Vinayakumar, R., Alazab, M., Soman, K.P., Poornachandran, P., Venkatraman, S.: Robust intelligent malware detection using deep learning. IEEE Access 7, 46717–46738 (2019)
Vinayakumar, R., Soman, K.P., Poornachandran, P.: Detecting malicious domain names using deep learning approaches at scale. J. Intell. Fuzzy Syst. 34(3), 1355–1367 (2018)
Vinayakumar, R., Soman, K.P., Poornachandran, P.: Evaluating deep learning approaches to characterize and classify malicious url’s. J. Intell. Fuzzy Syst. 34(3), 1333–1343 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kashihara, K., Sandhu, H.S., Shakarian, J. (2022). Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-82199-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-82199-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82198-2
Online ISBN: 978-3-030-82199-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)