Skip to main content

Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation

  • Conference paper
  • First Online:
Chinese Lexical Semantics (CLSW 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12278))

Included in the following conference series:

Abstract

Word segmentation and named entity annotation are essential foundations for medical text information extraction. This paper focuses on clinical pediatric diseases and takes the existing medical named entities and entity-relationship labeling systems as references. Under the guidance of the Chinese word segmentation and named entity labeling, the specifications for pediatric medical texts have been constructed in this paper. This paper also applies a self-developed distributed annotation platform to pre-annotate and manually proofread the named entities for many times. The corpus consists of 38,805 medical entries which can be divided into nine categories. Among the medical entries, there are 504 entries of common pediatric diseases, 7,085 entries of body parts, 12,907 entries of clinical manifestations, and 4,354 entries of medical procedures. This paper constructs the largest corpus with pediatric medical word segmentation and named entity annotation, which provides a data basis for related research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nadkarni, P., Chen, R., Brandt, C.: UMLS concept indexing for production databases. J. Am. Med. Inf. Assoc. 8, 80–91 (2001)

    Article  Google Scholar 

  2. Névéol, A., Grouin, C., Leixa, J., Rosset, S., Zweigenbaum, P.: The QUAERO French medical corpus: a ressource for medical entity recognition and normalization. In: Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014, pp. 24–30 (2014)

    Google Scholar 

  3. 阮彤,孙程琳,王昊奋,方之家,殷亦超.中医药知识图谱构建与应用. 医学信息学杂志 37(04), 8–13 (2016). Ruan, T., Sun, C.-L., Wang, H.-F., et al.: Construction and Application of TCM Knowledge Graph. J. Med. Inf. 37, 8–13 (2016) (in Chinese)

    Google Scholar 

  4. 孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究.中文信息学报 01, 1–6 (2000). Sun, M.-S., Zuo, Z.-P., Huang, C.-N.: Experimental research on the dictionary mechanism of Chinese automatic segmentation. J. Chin. Inf. Process. 1, 1–6 (2000) (in Chinese)

    Google Scholar 

  5. 袁向铎. 基于统计和规则的中文地址分词系统设计与实现.东南大学 (2018). Yuan, X.-D.: Design and Implementation of Chinese Address Word Segmentation System Based on Statistics and Rules. Southeast University, Nanjing (2018) (in Chinese)

    Google Scholar 

  6. Dong, X., et al.: Data-driven information extraction from Chinese electronic medical records. PLoS ONE 10(8), e0136270 (2015)

    Article  Google Scholar 

  7. 周寅. 融合深度学习特征与浅层机器学习特征的中文分词关键技术研究. 华中师范大学 (2017). Zhou, Y.: Research on Key Technologies of Chinese Word Segmentation Combining Deep Learning Features and Shallow Machine Learning Features. Central China Normal University, Wuhan (2017) (in Chinese)

    Google Scholar 

  8. 韩冬煦,常宝宝.中文分词模型的领域适应性方法.计算机学报 38(02), 272–281 (2015). Han, D.-X., Chang, B.-B.: Domain adaptation method of Chinese word segmentation model. Chin. J. Comput. 38, 272–281 (2015) (in Chinese)

    Google Scholar 

  9. 张义,李治江.基于高斯词长特征的中文分词方法.中文信息学报 30(05), 89–93 (2016). Zhang, Y., Li, Z.-J.: Chinese word segmentation method based on Gaussian word length feature. J. Chin. Inf. Process. 30, 89–93 (2016) (in Chinese)

    Google Scholar 

  10. 彭湃. 自然语言处理—中文词和短文本向量化的研究.华中师范大学 (2019). Peng, P.: Natural Language Processing—Research on Vectorization of Chinese Words and Short Texts. Central China Normal University, Wuhan (2019) (in Chinese)

    Google Scholar 

  11. 涂文博,袁贞明,俞凯.无池化层卷积神经网络的中文分词方法. 计算机工程与应用 56(02), 120–126 (2020). Tu, W.-B., Yuan, Z.-M., Yu, K., et al.: Chinese word segmentation method without pooling layer convolutional neural network. Comput. Eng. Appl. 56, 120–126 (2020) (in Chinese)

    Google Scholar 

  12. 沈晓明, 桂永浩. 临床儿科学.第2版. 人民卫生出版社 (2013). Shen, X.-M,. Gui, Y.-H.: Clinial Pediatrics. 2nd edu. People’s Medical Publishing House, Beijing (2013) (in Chinese)

    Google Scholar 

  13. 冯俐.中文分词技术综述.现代计算机(专业版) 34, 17–20 (2018). Feng, L.: Overview of Chinese word segmentation technology. Modern Comput. 34, 17–20 (2018) (in Chinese)

    Google Scholar 

  14. 李原. 中文文本分类中分词和特征选择方法研究.吉林大学 (2011). Li, Y.: Research on word segmentation and feature selection methods in Chinese text classification. Jilin University, Jilin (2011) (in Chinese)

    Google Scholar 

  15. 俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范.中文信息学报 05, 49–64 (2002). Yu, S.-W., Duan, H.-M., Zhu, X.-F., et al.: Basic processing standards of Peking university modern Chinese Corpus. J. Chin. Inf. Process. 5, 49–64 (2002) (in Chinese)

    Google Scholar 

  16. Lipscomb, C.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88, 265-266 (2000)

    Google Scholar 

  17. 张明淘,韩普. 医疗实体识别研究进展. 计算机技术与发展 04, 1–10 (2020). Zhang, M.-T., Han, P.: Research progress of medical entity recognition. Comput. Technol. Dev. 4, 1–10 (2020) (in Chinese)

    Google Scholar 

  18. 黎绍武. 基于文本挖掘的胶质瘤蛋白质相互作用抽取方法的研究.华南理工大学 (2018). Li, S.-W.: Research on Extraction Method of Glioma Protein Interaction Based on Text Mining. South China University of Technology, Guangzhou (2018) (in Chinese)

    Google Scholar 

  19. 昝红英,窦华溢,贾玉祥,关同峰,奥德玛,张坤丽,穗志方. 基于多来源文本的中文医学知识图谱的构建. 郑州大学学报(理学版) 52(02), 48–54 (2020). Zan, H.-Y., Dou, H.-Y., Jia, Y.-X., et al.: Construction of Chinese medical knowledge graph based on multi-source Text. J. Zhengzhou Univ. Nat. Sci. Ed. 52, 45–51 (2020) (in Chinese)

    Google Scholar 

  20. 张坤丽,赵旭,关同峰,等. 面向医疗文本的实体及关系标注平台的构建及应用. 中文信息学报 34(6), 117–125 (2020). Zhang, K.-L. Zhao, X., Guan, T.-F., et al.: Construction and application of entity and relationship annotation platform for medical text. J. Chin. Inf. Process. 34, 117–125 (2020) (in Chinese)

    Google Scholar 

  21. Carletta, J.: Assessing agreement on classification tasks: The kappa statistic. Comput. Linguist. 22, 249–254 (1996)

    Google Scholar 

  22. Hripcsak, G., Rothschild, A.-S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inf. Assoc. JAMIA. 12, 296–298 (2005)

    Article  Google Scholar 

  23. Ogren, P., Savova, G., Chute, C.: Constructing evaluation corpora for automated clinical named entity recognition. In: Proceedings of the 12th World Congress on Health (Medical) Informatics. Marrakech, Morocco: European Language Resources Association (ELRA), pp. 2325–2330 (2008)

    Google Scholar 

  24. Artstein, R., Poesio. M.: Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 34, 555–596 (2008)

    Google Scholar 

Download references

Acknowledgement

We gratefully acknowledge the grant supports of the National Key Research and Development Project (Grant No. 2017YFB1002101), Major Program of National Social Science Foundation of China (Grant No. 17ZDA138), Science and Technique Program of Henan Province (Grant No. 192102210260), Medical Science and Technique Program Co-sponsored by Henan Province and Ministry (Grant No. SB201901021), Key Scientific Research Program of Higher Education of Henan Province (Grant No. 19A520003, 20A520038).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zan Hongying or Li Wenxin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hongying, Z., Wenxin, L., Kunli, Z., Yajuan, Y., Baobao, C., Zhifang, S. (2021). Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. In: Liu, M., Kit, C., Su, Q. (eds) Chinese Lexical Semantics. CLSW 2020. Lecture Notes in Computer Science(), vol 12278. Springer, Cham. https://doi.org/10.1007/978-3-030-81197-6_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81197-6_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81196-9

  • Online ISBN: 978-3-030-81197-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics