Skip to main content

Portuguese Named Entity Recognition Using LSTM-CRF

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2018)

Abstract

Named Entity Recognition is a challenging Natural Language Processing task for a language as rich as Portuguese. For this task, a Deep Learning architecture based on bidirectional Long Short-Term Memory with Conditional Random Fields has shown state-of-the-art performance for English, Spanish, Dutch and German languages. In this work, we evaluate this architecture and perform the tuning of hyperparameters for Portuguese corpora. The results achieve state-of-the-art performance using the optimal values for them, improving the results obtained for Portuguese language to up to 5 points in the F1 score.

Thanks to Data-H Data Science and Artificial Intelligence (www.datah.com.br) and Aviso Urgente (https://avisourgente.com.br) for the financial support, and to Cicero Nogueira dos Santos for kindly sharing insights regarding the HAREM corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.wikipedia.org/.

  2. 2.

    This is because I indicates an internal token in the named entity, and O indicates a non-entity token, which means that anything after it would be the starting token of an entity or another non-entity token. Since the first token of a named entity starts with B, according to the IOB scheme, it is not possible that an internal entity token follows a non-entity token.

References

  1. How Much Data is Created on the Internet Each Day? https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day/. Accessed 19 Mar 2018

  2. Maynard, D., Bontcheva, K., Augenstein, I.: Natural Language Processing for the Semantic Web, 1st edn. Morgan and Claypool, San Rafael (2017)

    Google Scholar 

  3. dos Santos, C., Guimarães, V.: Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008 (2015)

  4. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. arXiv preprint arxiv:1103.0398 (2011)

  6. Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. In: Artificial Intelligence, vol. 194, pp. 151–175. Elsevier Science Publishers Ltd., Essex (2013). https://doi.org/10.1016/j.artint.2012.03.006

    Article  MathSciNet  Google Scholar 

  7. Chiu, J., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. arXiv preprint arXiv:1511.08308 (2015)

  8. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354 (2016)

  9. Repositório de Word Embeddings do NILC. http://www.nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc. Accessed 30 Mar 2018

  10. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  11. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  12. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), vol. 12, pp. 1532–1543 (2014)

    Google Scholar 

  13. Ling, W., Dyer, C., Black, A., Trancoso, I.: Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2015)

    Google Scholar 

  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arxiv:1301.3781 (2013)

  15. Amaral, D., Vieira, R.: NERP-CRF: a tool for the named entity recognition using conditional random fields. In: Linguamática, vol. 6, pp. 41–49 (2014)

    Google Scholar 

  16. Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2013)

    Article  Google Scholar 

  17. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166 (1994). https://doi.org/10.1109/72.279181

    Article  Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  19. Sang, E., Veenstra, J.: Representing text chunks. arXiv preprint arxiv:cs/9907006 (1999)

  20. HAREM: Reconhecimento de entidades mencionadas em português. https://www.linguateca.pt/HAREM/. Accessed 21 Mar 2018

  21. Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Vitor Quinta de Castro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A. (2018). Portuguese Named Entity Recognition Using LSTM-CRF. In: Villavicencio, A., et al. Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science(), vol 11122. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99722-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99721-6

  • Online ISBN: 978-3-319-99722-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics