Skip to main content

Named Entity Recognition for Brazilian Portuguese Product Titles

  • Conference paper
  • First Online:
  • 1008 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Abstract

Improving the interaction between consumers and marketplaces, focusing on reaching higher conversion rates is one of the main goals of e-commerce companies. Offering better results for user queries is mandatory to improve user experience and convert it into purchases. This paper investigates how named entity recognition can extract relevant attributes from product titles to derive better filters for user queries. We conducted several experiments based on MITIE and BERT applied to smartphones/cellphones product titles from the largest Brazilian retail e-commerce. Both of our strategies achieve outstanding results with a general F1 score of around 95%. We concluded that using a classical machine learning pipeline is still more useful than relying on large pre-trained language models, considering the model’s throughput and efficiency. Future work may focus on evaluating the scalability and reusability capacity of both approaches.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://company.ebit.com.br/webshoppers/webshoppersfree.

  2. 2.

    In English: Samsung Galaxy A01 Smartphone, 32 GB, 2 GB RAM, 5.7\(^{\prime \prime }\) Infinite Screen, 13MP Dual Rear Camera (Main) + 2MP (Depth), 5MP Front, 3000 mAh Battery, Dual Chip, Android - Blue.

  3. 3.

    In English: Moto G9 Play 64 GB Smartphone Dual Chip Android 10 Screen 6.5\(^{\prime \prime }\) Qualcomm Snapdragon 4G Camera 48MP+2MP+2MP - Turquoise Green.

  4. 4.

    In English: iPhone SE 128 GB Black iOS 4G Wi-Fi Screen 4.7\(^{\prime \prime }\) 12MP + 7MP Camera - Apple.

  5. 5.

    WIT stands for “What Is This” and is used to define the product being sold.

  6. 6.

    The Smartphone category encompasses cell phones, smartphones and iPhones sold by the largest Brazilian e-commerce marketplace, B2W Digital.

  7. 7.

    https://github.com/mit-nlp/MITIE.

  8. 8.

    In this paper, the tags associated with the attributes are presented between brackets and the values associated to them, in bold.

  9. 9.

    Stop words list in Portuguese: de, a, o, que, e, do, da, em, os, no, na, por, as, dos, ao, das, á, ou, ás, com.

  10. 10.

    Outliers refers to those product titles that do not belong to the Smartphone category.

  11. 11.

    Examples of removed outliers, in Portuguese: “célula vegetal ampliada aproximadamente 20 mil vezes”, “celula carga 10 kg sensor peso arduino” and “cédula foleada ouro 100 euros coleção notas moedas euro”.

  12. 12.

    We didn’t remove the token“iphone” and its spelling variations since the format of iPhone titles is very different from the other items in the data set and the experiments removing “iphone” led to results worse than those not removing it.

  13. 13.

    Since how to measure the inter-annotator agreement of named entities annotation is a debatable task [5], we followed [1] and considered the percentages of agreement as our main metric.

  14. 14.

    https://huggingface.co/bert-base-multilingual-cased/tree/main.

  15. 15.

    https://github.com/neuralmind-ai/portuguese-bert.

  16. 16.

    https://www.linguateca.pt/HAREM/.

  17. 17.

    From previous experiments, we concluded that the feature-based models as well as the ones trained with the multilingual BERT did not reach results as good as the fine-tuned models trained with BERTimbau-Base, so these options are not described here.

  18. 18.

    These two configurations were the ones which achieved the best results in previous experiments with product tiles from Fashion category.

  19. 19.

    https://github.com/mit-nlp/MITIE.

References

  1. HAREM linguateca datasets. https://www.linguateca.pt/HAREM/. Accessed 08 Jan 2021

  2. Repositório oficial - MITIE. https://github.com/mit-nlp/MITIE

  3. Prodigy: An annotation tool for AI, machine learning, and NLP (2017). https://prodi.gy. Accessed 08 Jan 2021

  4. Cheng, X., Bowden, M., Bhange, B.R., Goyal, P., Packer, T., Javed, F.: An end-to-end solution for named entity recognition in ecommerce search. arXiv preprint arXiv:2012.07553 (2020)

  5. Deleger, L., et al.: Building gold standard corpora for medical natural language processing tasks. In: AMIA Annual Symposium Proceedings (2012)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)

    Google Scholar 

  7. Geyer, K., Greenfield, K., Mensch, A.C., Simek, O.: Named entity recognition in 140 characters or less (2016)

    Google Scholar 

  8. Joachims, T., Finley, T., Yu, C.J.: Cutting-plane training of structural SVMs. Mach. Learn. 77, 27–59 (2009)

    Article  Google Scholar 

  9. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119 (2013)

    Google Scholar 

  11. More, A.: Attribute extraction from product titles in ecommerce. In: Workshop on Enterprise Intelligence - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2016)

    Google Scholar 

  12. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  13. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing, vol. 14, pp. 1532–1543 (2014)

    Google Scholar 

  14. Real, L., Johansson, K., Mendes, J., Lopes, B., Oshiro, M.: Generating e-commerce product titles in Portuguese. In: Anais do XLVIII Seminário Integrado de Software e Hardware, pp. 299–304. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/semish.2021.15835. https://sol.sbc.org.br/index.php/semish/article/view/15835

  15. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649 (2019). http://arxiv.org/abs/1909.10649

  16. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020)

    Google Scholar 

  17. Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: International Conference on Language Resources and Evaluation, pp. 4339–4344 (2018)

    Google Scholar 

  18. Xu, H., Wang, W., Mao, X., Jiang, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019). https://doi.org/10.18653/v1/P19-1514

Download references

Acknowledgment

This paper and the research behind it would not have been possible without the support of americanas s.a. Digital Lab, specially José Pizani, Ester Campos and Thiago Gouveia Nunes, who closely followed this research. This work is part of the project “Dos dados ao conhecimento: extração e representação de informação no domínio do e-commerce” (Projeto de extensão - UFSCar #23112.000186/2020-97).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego F. Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silva, D.F. et al. (2021). Named Entity Recognition for Brazilian Portuguese Product Titles. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91699-2_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91698-5

  • Online ISBN: 978-3-030-91699-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics