Abstract
This paper proposes a BERT-BiLSTM-CRF Xinjiang local drug name recognition method embedded in the BERT (Bidirectional Encoder Representations from Transformers) pre-training language model. The method is pre-trained by the two-way Transformer structure. The training method of MaskLM is used to randomly select some Chinese characters of the input sequence to be replaced with special symbols. The word vector is dynamically generated according to the position information of Chinese characters in Xinjiang local drug names, and then the word vector sequence is input into two directions. The LSTM layer is trained to obtain the dependencies between the sequences. Finally, the CRF module takes the joint distribution probability of the entire marker sequence as the output, and obtains the global optimal test result. The model obtains the named entity recognition on the Xinjiang local drug corpus. The accuracy rate is 95.77%, the recall rate is 89.47%, and the F value is 92.52%. The experimental results show that BERT-BiLSTM-CRF can effectively improve the evaluation indexes of Xinjiang local drug name identification methods in practical applications.
Similar content being viewed by others
REFERENCES
Nadeau, D. and Sekine, S., A survey of named entity recognition and classification, Lingvist. Invest., 2007, vol. 30, no. 1, pp. 3–26.
Segun Taofeek Aroyehun and Gelbukh, A., Automatic identification of drugs and adverse drug reaction related tweets, Proceedings of the 3rd Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task (ACL2018), 2018, pp. 54–55.
Peng, N.Y. and Dredze, M., Named entity recognition for Chinese social media with jointly trained embeddings, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 2015, pp. 548–554.
He, J. and Wang, H., Chinese named entity recognition and word segmentation based on character, Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008.
Liu, Z., Zhu, C., and Zhao, T., Chinese named entity recognition with a sequence labeling approach: Based on characters, or based on words?, in Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, Springer-Verlag Berlin Heidelberg, 2010.
Li, H., Hagiwara, M., Li, Q., et al., Comparison of the impact of word segmentation on name tagging for Chinese and Japanese, LREC, 2014, pp. 2532–2536.
Yanan Lu, Yue Zhang, and Dong-Hong Ji, Multi-prototype Chinese character embedding, LREC, Berlin, 2016.
Dong, C., Zhang, J., Zong, C., et al., Character-based LSTM-CRF with radical-level features for Chinese named entity recognition, in Natural Language Understanding and Intelligent Applications, Cham: Springer, 2016, pp. 239–250.
Peng, N. and Dredze, M., Named entity recognition for Chinese social media with jointly trained embeddings, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 548–554.
He, H. and Sun, X., F-score driven max margin neural network for named entity recognition in Chinese social media, 2016. arXiv:1611.04234 [cs.CL]
Strubell, E., Verga, P., Belanger, D., et al., Fast and accurate entity recognition with iterated dilated convolutions, 2017. arXiv:1702.02098
Rei, M., Semi-supervised multitask learning for sequence labeling, 2017. arXiv:1704.07156
Omid Ghiasvand and Kate, R.J., Learning for clinical named entity recognition without manual annotations, Inf. Med. Unlocked, 2018, vol. 13, pp. 122–127.
Muhammad Khalifa and Khaled Shaalan, Character convolutions for Arabic named entity recognition with long short-term memory networks, Comput. Speech Lang., 2019, vol. 58, pp. 335–346.
Yao Chen, Changjiang Zhou, Tianxin Li, Hong Wu, Xia Zhao, Kai Ye, and Jun Liao, Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training, J. Biomed. Inf., 2019, vol. 96.
Vaswani, A., Shazeer, N., Parmar, N., et al., Attention is all you need, in Advances in Neural Information Processing Systems, Long Beach: NIPS, 2017, pp. 6000–6010.
Collobert, R., Bottou, J.W.L., Karlen, M., et al., Natural language processing (almost) from scratch, J. Mach. Learn. Res., 2011, vol. 12, pp. 2493–2537.
Li, L.S., Mao, T., Huang, D., et al., Hybrid models for Chinese named entity recognition, Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, Beijing, 2006, pp. 72–78.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J., Distributed representations of words and phrases and their compositionality, 2013. arXiv:1310.4546
Xuezhe Ma and Hovy, E., End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL2016), 2016, pp. 1064–1074.
Chen, Z.G., He, P.L., Sun, Y.H., et al., Research and implementation of text classification system based on VSP, J. Chin. Inf. Process., 2005, vol. 19, no. 1, pp. 37–41.
ACKNOWLEDGMENTS
I would like to thank Yang Qimeng, Hu Wei, Kang Keming, Wang Xiaozhuo, Jiang Yuan and other students for their help and support in this article. I would like to extend my sincere gratitude and highest respect to them.
Funding
This work was supported by National Natural Science Foundation of China (nos. 61563051, 61662074, 61262064), The Key Project of Nation-al Natural Science Foundation of China (no. 61331011), Xinjiang Uygur Autonomous Region Scientific, Technological Personnel Training Project (no. QN2016YX0051) and Tianshan Excellent Youth Fund of Xinjiang Autonomous Region (Q011).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare that they have no conflicts of interest.
About this article
Cite this article
Song, Y., Tian, S. & Yu, L. A Method for Identifying Local Drug Names in Xinjiang Based on BERT-BiLSTM-CRF. Aut. Control Comp. Sci. 54, 179–190 (2020). https://doi.org/10.3103/S0146411620030098
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0146411620030098