Named Entity Recognition with Homophones-Noisy Data

Liu, Zhicheng; Wu, Gang

doi:10.1007/978-3-030-29908-8_26

Named Entity Recognition with Homophones-Noisy Data

Zhicheng Liu¹⁰ &
Gang Wu¹⁰

Conference paper
First Online: 23 August 2019

2127 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11670))

Abstract

General named entity recognition systems exclusively focus on higher accuracy regardless of dirty data. However, raw source data face serious challenges specially that are originated from automated speech recognition systems’ results. In this paper, we propose Pinyin (Pinyin is the official romanization system for Standard Chinese, each Chinese character has its own pinyin sequence which is composed of Latin alphabet) Hierarchical Attention Encoder-Decoder network and Character Alternate Network to overcome Chinese homophones’ problems which frequently frustrate researchers in consecutive Natural Language Understanding (NLU). Our models present a none word segmentation structure to effectively avoid secondary data corruption and adequately extract words’ internal features. Besides, corrupted sequences can be revised by character-level network. Evaluation demonstrates that our proposed method achieves 93.73% F1 scores which are higher than 90.97% F1 scores using baseline models in homophone-noisy dataset. Additional experiments are conducted to show equivalent results in the universal dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Character accuracy rate = numbers of correct rectifying characters/numbers of wrong characters.
2.
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Boag, W., Sergeeva, E., Kulshreshtha, S., Szolovits, P., Rumshisky, A., Naumann, T.: Cliner 2.0: Accessible and accurate clinical concept extraction. arXiv preprint arXiv:1803.02245 (2018)
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: ICASSP 2018 (2018, submitted)
Google Scholar
Chiu, J., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4(1), 357–370 (2016)
Article Google Scholar
Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Syntax, Semantics and Structure in Statistical Translation, p. 103 (2014)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)
Google Scholar
Deng, L., Seltzer, M.L., Yu, D., Acero, A., Mohamed, A.R., Hinton, G.: Binary coding of speech spectrograms using a deep auto-encoder. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Google Scholar
Ding, C., Xie, L., Yan, J., Zhang, W., Liu, Y.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 98–102. IEEE (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, pp. 427–431 (2017)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016)
Google Scholar
Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, pp. 260–270 (2016)
Google Scholar
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1756–1765 (2017)
Google Scholar
Ratinov, P.B., Kanevsky, D.: Multilanguage machine translation speech corrector (2015)
Google Scholar
dos Santos, C., Guimaraes, V., Niterói, R., de Janeiro, R.: Boosting named entity recognition with neural character embeddings. In: Proceedings of NEWS 2015 the Fifth Named Entities Workshop, p. 25 (2015)
Google Scholar
Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
The-Commercial-Press: (1957)
Google Scholar
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CONLL-2000 shared task: chunking. In: Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-vol. 7, pp. 127–132. Association for Computational Linguistics (2000)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-vol. 4, pp. 142–147. Association for Computational Linguistics (2003)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
Google Scholar
Wang, D., Zhang, X.: Thchs-30: A free Chinese speech corpus (2015)
Google Scholar
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)
Google Scholar
Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks. In: International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Zweig, G., Ju, Y.C.: Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition, 4 October 2016. US Patent 9,460,708
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiaotong University, Shanghai, China
Zhicheng Liu & Gang Wu

Authors

Zhicheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gang Wu .

Editor information

Editors and Affiliations

Department of Computing, Macquarie University, Sydney, NSW, Australia
Abhaya C. Nayak
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Wu, G. (2019). Named Entity Recognition with Homophones-Noisy Data. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11670. Springer, Cham. https://doi.org/10.1007/978-3-030-29908-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-29908-8_26
Published: 23 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29907-1
Online ISBN: 978-3-030-29908-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics