Skip to main content

Named Entity Recognition with Homophones-Noisy Data

  • Conference paper
  • First Online:
  • 2127 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11670))

Abstract

General named entity recognition systems exclusively focus on higher accuracy regardless of dirty data. However, raw source data face serious challenges specially that are originated from automated speech recognition systems’ results. In this paper, we propose Pinyin (Pinyin is the official romanization system for Standard Chinese, each Chinese character has its own pinyin sequence which is composed of Latin alphabet) Hierarchical Attention Encoder-Decoder network and Character Alternate Network to overcome Chinese homophones’ problems which frequently frustrate researchers in consecutive Natural Language Understanding (NLU). Our models present a none word segmentation structure to effectively avoid secondary data corruption and adequately extract words’ internal features. Besides, corrupted sequences can be revised by character-level network. Evaluation demonstrates that our proposed method achieves 93.73% F1 scores which are higher than 90.97% F1 scores using baseline models in homophone-noisy dataset. Additional experiments are conducted to show equivalent results in the universal dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Character accuracy rate = numbers of correct rectifying characters/numbers of wrong characters.

  2. 2.

    https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  2. Boag, W., Sergeeva, E., Kulshreshtha, S., Szolovits, P., Rumshisky, A., Naumann, T.: Cliner 2.0: Accessible and accurate clinical concept extraction. arXiv preprint arXiv:1803.02245 (2018)

  3. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: ICASSP 2018 (2018, submitted)

    Google Scholar 

  4. Chiu, J., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4(1), 357–370 (2016)

    Article  Google Scholar 

  5. Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Syntax, Semantics and Structure in Statistical Translation, p. 103 (2014)

    Google Scholar 

  6. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)

    Google Scholar 

  7. Deng, L., Seltzer, M.L., Yu, D., Acero, A., Mohamed, A.R., Hinton, G.: Binary coding of speech spectrograms using a deep auto-encoder. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

    Google Scholar 

  8. Ding, C., Xie, L., Yan, J., Zhang, W., Liu, Y.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 98–102. IEEE (2015)

    Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  11. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, pp. 427–431 (2017)

    Google Scholar 

  12. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016)

    Google Scholar 

  13. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015)

    Google Scholar 

  14. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, pp. 260–270 (2016)

    Google Scholar 

  15. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015)

    Google Scholar 

  16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  17. Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1756–1765 (2017)

    Google Scholar 

  18. Ratinov, P.B., Kanevsky, D.: Multilanguage machine translation speech corrector (2015)

    Google Scholar 

  19. dos Santos, C., Guimaraes, V., Niterói, R., de Janeiro, R.: Boosting named entity recognition with neural character embeddings. In: Proceedings of NEWS 2015 the Fifth Named Entities Workshop, p. 25 (2015)

    Google Scholar 

  20. Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)

  21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  22. The-Commercial-Press: (1957)

    Google Scholar 

  23. Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CONLL-2000 shared task: chunking. In: Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-vol. 7, pp. 127–132. Association for Computational Linguistics (2000)

    Google Scholar 

  24. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-vol. 4, pp. 142–147. Association for Computational Linguistics (2003)

    Google Scholar 

  25. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)

    Google Scholar 

  26. Wang, D., Zhang, X.: Thchs-30: A free Chinese speech corpus (2015)

    Google Scholar 

  27. Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)

    Google Scholar 

  28. Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks. In: International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  29. Zweig, G., Ju, Y.C.: Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition, 4 October 2016. US Patent 9,460,708

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gang Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Z., Wu, G. (2019). Named Entity Recognition with Homophones-Noisy Data. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11670. Springer, Cham. https://doi.org/10.1007/978-3-030-29908-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29908-8_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29907-1

  • Online ISBN: 978-3-030-29908-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics