Skip to main content
Log in

Enriching Word Information Representation for Chinese Cybersecurity Named Entity Recognition

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Named entity recognition (NER) is a word-level sequence tagging task. The key of Chinese cybersecurity NER is to obtain meaningful word representations and to delicately model the inter-word relations. However, Chinese is a language of compound words and lacks morphological inflections. Moreover, the role and meaning of a word depends on the context in a complicated way. In this paper, we present an NER model named Star-HGCN, short for Star-Transformer with Hybrid embeddings and Graph Convolutional Network. To make full use of the intra-word information, we set a hybrid embedding layer at the very beginning, which enriches word representations with character-level information and part-of-speech features. More importantly, we further enhance the hybrid embeddings by modeling inter-word implicit local and long-range semantic associations using the efficient Star-Transformer architecture, and modeling the explicit syntactic dependencies between words in the dependency tree using the graph convolutional network. Experiments on the Chinese cybersecurity dataset show that our model is superior to other neural network methods for NER, and achieves a significant relative improvement of 36.59% for the class of software entities. Experiments on other public datasets also validate the effectiveness of the model on other general and specific domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. https://www.digmandarin.com/chinese-a-language-of-compound-words.html.

  2. https://github.com/hltcoe/golden-horse.

  3. https://github.com/worry1613/nlp-ner.

  4. https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414.

References

  1. Appelt D, Hobbs JR, Bear J, et al (1995) SRI international FASTUS system: MUC-6 test results and analysis. In: Sixth message understanding conference (MUC-6): proceedings of a conference held in Columbia, Maryland, November 6–8, 1995

  2. Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34(1):211–231

    Article  MATH  Google Scholar 

  3. Cetoli A, Bragaglia S, O’Harney A, et al (2017) Graph convolutional networks for named entity recognition. In: Proceedings of the 16th international workshop on treebanks and linguistic theories, Prague, Czech Republic, pp 37–45. https://aclanthology.org/W17-7607

  4. Conlon SJ, Abrahams AS, Simmons LL (2015) Terrorism information extraction from online reports. J Comput Inf Syst 55(3):20–28. https://doi.org/10.1080/08874417.2015.11645768

    Article  Google Scholar 

  5. Gasmi H, Laval J, Bouras A (2019) Information extraction of cybersecurity concepts: an LSTM approach. Appl Sci 9(19):3945. https://doi.org/10.3390/app9193945

    Article  Google Scholar 

  6. Ghazi Y, Anwar Z, Mumtaz R, et al (2018) A supervised machine learning based approach for automatically extracting high-level threat intelligence from unstructured sources. In: 2018 International conference on frontiers of information technology (FIT). IEEE, pp 129–134. https://doi.org/10.1109/fit.2018.00030

  7. Gomez-Hidalgo JM, Martín-Abreu JM, Nieves J, et al (2010) Data leak prevention through named entity recognition. In: 2010 IEEE second international conference on social computing. IEEE, pp 1129–1134. https://doi.org/10.1109/socialcom.2010.167

  8. Guo Q, Qiu X, Liu P, et al (2019) Star-Transformer. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1315–1325. https://doi.org/10.18653/v1/N19-1133

  9. Hammerton J (2003) Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, vol 2003, pp 172–175. https://doi.org/10.3115/1119176.1119202

  10. He H, Sun X (2017) A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media. In: Proceedings of the AAAI conference on artificial intelligence

  11. Hou J, Li X, Yao H et al (2020) BERT-based Chinese relation extraction for public security. IEEE Access 8:132,367-132,375. https://doi.org/10.1109/ACCESS.2020.3002863

    Article  Google Scholar 

  12. Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. CoRR arXiv:1508.01991

  13. Husari G, Niu X, Chu B, et al (2018) Using entropy and mutual information to extract threat actions from cyber threat intelligence. In: 2018 IEEE international conference on intelligence and security informatics (ISI). IEEE, pp 1–6. https://doi.org/10.1109/isi.2018.8587343

  14. Isozaki H, Kazawa H (2002) Efficient support vector classifiers for named entity recognition. In: COLING 2002: the 19th international conference on computational linguistics. https://doi.org/10.3115/1072228.1072282

  15. Jia Y, Qi Y, Shang H et al (2018) A practical approach to constructing a knowledge graph for cybersecurity. Engineering 4(1):53–60. https://doi.org/10.1016/j.eng.2018.01.004

    Article  Google Scholar 

  16. Joshi A, Lal R, Finin T, et al (2013) Extracting cybersecurity related linked data from text. In: 2013 IEEE seventh international conference on semantic computing. IEEE, pp 252–259. https://doi.org/10.1109/icsc.2013.50

  17. Kim JH, Woodland P (2000) A rule-based named entity recognition system for speech input. pp 528–531

  18. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: International conference on learning representations

  19. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01. Morgan Kaufmann Publishers Inc., San Francisco, pp 282–289

  20. Lal R (2013) Information extraction of security related entities and concepts from unstructured text. Master’s thesis, University of Maryland Baltimore County

  21. Landauer M, Skopik F, Wurzenberger M, et al (2019) A framework for cyber threat intelligence extraction from raw log data. In: 2019 IEEE international conference on big data (big data). IEEE, pp 3200–3209. https://doi.org/10.1109/bigdata47090.2019.9006328

  22. Li S, Zhao Z, Hu R, et al (2018) Analogical reasoning on Chinese morphological and semantic relations. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Melbourne, pp 138–143. https://doi.org/10.18653/v1/P18-2023

  23. Li X, Yan H, Qiu X, et al (2020) FLAT: Chinese NER using flat-lattice transformer. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6836–6842. https://doi.org/10.18653/v1/2020.acl-main.611

  24. Ling W, Dyer C, Black AW, et al (2015) Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, pp 1520–1530. https://doi.org/10.18653/v1/D15-1176

  25. Liu H, Song J, Peng W, et al (2022) TFM: A triple fusion module for integrating lexicon information in Chinese named entity recognition. Neural Process Lett 1–18. https://doi.org/10.1007/s11063-022-10768-y

  26. Ma R, Peng M, Zhang Q, et al (2020) Simplify the usage of lexicon in Chinese NER. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 5951–5960. https://doi.org/10.18653/v1/2020.acl-main.528

  27. Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 1064–1074. https://doi.org/10.18653/v1/P16-1101

  28. Marcheggiani D, Titov I (2017) Encoding sentences with graph convolutional networks for semantic role labeling. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, pp 1506–1515. https://doi.org/10.18653/v1/D17-1159

  29. McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, vol 2003, pp 188–191. https://doi.org/10.3115/1119176.1119206

  30. Mulwad V, Li W, Joshi A, et al (2011) Extracting information about security vulnerabilities from web text. In: 2011 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology. IEEE, pp 257–260. https://doi.org/10.1109/wi-iat.2011.26

  31. Peng N, Dredze M (2016) Improving named entity recognition for Chinese social media with word segmentation representation learning. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Berlin, pp 149–155. https://doi.org/10.18653/v1/P16-2025

  32. Souza F, Nogueira RF, de Alencar Lotufo R (2019) Portuguese named entity recognition using BERT-CRF. CoRR arXiv:1909.10649

  33. Szarvas G, Farkas R, Kocsor A (2006) A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms. In: International conference on discovery science. Springer, pp 267–278

  34. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010. https://doi.org/10.5555/3295222.3295349

  35. Wang W, Bao F, Gao G (2019) Learning morpheme representation for Mongolian named entity recognition. Neural Process Lett 50(3):2647–2664. https://doi.org/10.1007/s11063-019-10044-6

    Article  Google Scholar 

  36. Wang Y, Sun Y, Ma Z, et al (2020) Application of pre-training models in named entity recognition. In: 2020 12th International conference on intelligent human–machine systems and cybernetics (IHMSC), pp 23–26. https://doi.org/10.1109/IHMSC49165.2020.00013

  37. Xie B, Shen G, Guo C et al (2021) The named entity recognition of Chinese cybersecurity using an active learning strategy. Wirel Commun Mob Comput 2021:1–11. https://doi.org/10.1155/2021/6629591

    Article  Google Scholar 

  38. Yan H, Deng B, Li X, et al (2019) TENER: adapting transformer encoder for named entity recognition. CoRR arXiv:1911.04474

  39. Yan R, Jiang X, Dang D (2021) Named entity recognition by using XLNet-BiLSTM-CRF. Neural Process Lett 53(5):3339–3356. https://doi.org/10.1007/s11063-021-10547-1

    Article  Google Scholar 

  40. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1):69–90

    Article  Google Scholar 

  41. Zhang S, Wang L, Sun K, et al (2020) A practical Chinese dependency parser based on a large-scale dataset. CoRR arXiv:2009.00901

  42. Zhang Y, Yang J (2018) Chinese NER using lattice LSTM. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Melbourne, pp 1554–1564. https://doi.org/10.18653/v1/P18-1144

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant numbers 62102279 and 11702289), the Key Core Technology and Generic Technology R &D Project of Shanxi Province (Grant number 2020XXX013), the Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi (Grant number 2020L0102), and the National Key R &D Program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wen Zheng or Cai Zhao.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, D., Lian, T., Zheng, W. et al. Enriching Word Information Representation for Chinese Cybersecurity Named Entity Recognition. Neural Process Lett 55, 7689–7707 (2023). https://doi.org/10.1007/s11063-023-11280-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11280-7

Keywords

Navigation