Abstract
The automated and timely conversion or extraction of cybersecurity information from unstructured text from online sources is important and required for many applications. Named Entity Recognition (NER) is used to detect the relevant domain entities such as product, attack name, malware name, hacker group name, etc. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities and state-of-the-art methods require time-consuming and labor intensive feature engineering. We propose a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. Our method evaluates the learned NER model with the sentences that we collected in the training process, and the user selects only the correct pair of the named entity and its category for next iteration training. Thus, each iteration gets better training corpora to train the NER model. Some entities are ambiguous since the word or phrase has multiple meanings. We introduce a new semantic similarity measure and determine which category the word belongs to based on this semantic similarity of the entire sentence. The experimental evaluation result shows that our method is better than existing methods in finding undiscovered keywords of given categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. COLING 2018, Santa Fe, New Mexico, USA, 20–26 August 2018, pp. 1638–1649 (2018)
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785 (2019)
Bridges, R.A., Jones, C.L., Iannacone, M.D., Goodall, J.R.: Automatic labeling for entity extraction in cyber security. CoRR, abs/1308.4941 (2013)
Carreras, X., Mà rquez, L., Padró, L.: Learning a perceptron-based named entity chunker via online recognition feedback. In: Proceedings of the Seventh Conference on Natural Language Learning. CoNLL 2003, Held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, 31 May–1 June 2003, pp. 156–159 (2003)
Chieu, H.L., Ng, H.T.: Named entity recognition: a maximum entropy approach using global information. In: 19th International Conference on Computational Linguistics. COLING 2002, 24 August–1 September 2002. Howard International House and Academia Sinica, Taipei (2002)
Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of the 13th International Conference on World Wide Web. WWW 2004, New York, NY, USA, 17–20 May 2004, pp. 462–471 (2004)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Volume 1 (Long and Short Papers). NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, pp. 4171–4186 (2019)
Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25–30 June 2005, pp. 363–370. University of Michigan, USA (2005)
Gasmi, H., Bouras, A., Laval, J.: LSTM recurrent neural networks for cybersecurity named entity recognition. In: ICSEA 2018, p. 11 (2018)
Gers, F.A., Schmidhuber, J., Cummins, F.A.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)
Graves, A.. Mohamed, A.-R., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 2013, Vancouver, BC, Canada, 26–31 May 2013, pp. 6645–6649 (2013)
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing, 7 (2017, to appear)
Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: 19th International Conference on Computational Linguistics. COLING 2002, 24 August–1 September 2002. Howard International House and Academia Sinica, Taipei (2002)
Jones, C.L., Bridges, R.A., Huffer, K.M.T., Goodall, J.R.: Towards a relation extraction framework for cyber-security concepts. In: Proceedings of the 10th Annual Cyber and Information Security Research Conference. CISR 2015, Oak Ridge, TN, USA, 7–9 April 2015, pp. 11:1–11:4 (2015)
Joshi, A., Lal, R., Finin, T., Joshi, A.:. Extracting cybersecurity related linked data from text. In: 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 September 2013, pp. 252–259 (2013)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA, 12–17 June 2016, pp. 260–270 (2016)
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), 29 June–2 July 2 2000, pp. 591–598. Stanford University, Stanford (2000)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning. CoNLL 2003, held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, 31 May–1 June 2003, pp. 188–191 (2003)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: word2vec (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting held Lake Tahoe, Nevada, United States, 5–8 December 2013, pp. 3111–3119 (2013)
Mulwad, V., Li, W., Joshi, A., Finin, T., Viswanathan, K.: Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops. WI-IAT 2011. 22–27 August 2011, pp. 257–260. Campus Scientifique de la Doua, Lyon (2011)
Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. Volume 2: Short Papers. ACL 2015, Beijing, China, 26–31 July 2015, pp. 365–371 (2015)
Pantel, P., Pennacchiotti, M.: Automatically harvesting and ontologizing semantic relations. In: Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, pp. 171–195 (2008)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CoNLL-2000 shared task chunking. In: Fourth Conference on Computational Natural Language Learning, CoNLL 2000, and the Second Learning Language in Logic Workshop. LLL 2000, held in Cooperation with ICGI-2000, Lisbon, Portugal, 13–14 September 2000, pp. 127–132 (2000)
Stenetorp, P., Pyysalo, S., Topic, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: a web-based tool for NLP-assisted text annotation. In: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012, pp. 102–107 (2012)
Acknowledgments
We thank you for Dr. Robert P. Trevino from Maui High Performance Computing center for useful discussion.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kashihara, K., Shakarian, J., Baral, C. (2021). Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251. Springer, Cham. https://doi.org/10.1007/978-3-030-55187-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-55187-2_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55186-5
Online ISBN: 978-3-030-55187-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)