Named Entity Recognition for Brazilian Portuguese Product Titles

Silva, Diego F.; Silva, Alcides M. e; Lopes, Bianca M.; Johansson, Karina M.; Assi, Fernanda M.; de Jesus, Júlia T. C.; Mazo, Reynold N.; Lucrédio, Daniel; Caseli, Helena M.; Real, Livy

doi:10.1007/978-3-030-91699-2_36

Named Entity Recognition for Brazilian Portuguese Product Titles

Diego F. Silva¹⁰,
Alcides M. e Silva¹⁰,
Bianca M. Lopes¹⁰,
Karina M. Johansson¹⁰,
Fernanda M. Assi¹⁰,
Júlia T. C. de Jesus¹⁰,
Reynold N. Mazo¹⁰,
Daniel Lucrédio¹⁰,
Helena M. Caseli¹⁰ &
…
Livy Real¹¹

Conference paper
First Online: 28 November 2021

1008 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Abstract

Improving the interaction between consumers and marketplaces, focusing on reaching higher conversion rates is one of the main goals of e-commerce companies. Offering better results for user queries is mandatory to improve user experience and convert it into purchases. This paper investigates how named entity recognition can extract relevant attributes from product titles to derive better filters for user queries. We conducted several experiments based on MITIE and BERT applied to smartphones/cellphones product titles from the largest Brazilian retail e-commerce. Both of our strategies achieve outstanding results with a general F1 score of around 95%. We concluded that using a classical machine learning pipeline is still more useful than relying on large pre-trained language models, considering the model’s throughput and efficiency. Future work may focus on evaluating the scalability and reusability capacity of both approaches.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://company.ebit.com.br/webshoppers/webshoppersfree.
2.
In English: Samsung Galaxy A01 Smartphone, 32 GB, 2 GB RAM, 5.7\(^{\prime \prime }\) Infinite Screen, 13MP Dual Rear Camera (Main) + 2MP (Depth), 5MP Front, 3000 mAh Battery, Dual Chip, Android - Blue.
3.
In English: Moto G9 Play 64 GB Smartphone Dual Chip Android 10 Screen 6.5\(^{\prime \prime }\) Qualcomm Snapdragon 4G Camera 48MP+2MP+2MP - Turquoise Green.
4.
In English: iPhone SE 128 GB Black iOS 4G Wi-Fi Screen 4.7\(^{\prime \prime }\) 12MP + 7MP Camera - Apple.
5.
WIT stands for “What Is This” and is used to define the product being sold.
6.
The Smartphone category encompasses cell phones, smartphones and iPhones sold by the largest Brazilian e-commerce marketplace, B2W Digital.
7.
https://github.com/mit-nlp/MITIE.
8.
In this paper, the tags associated with the attributes are presented between brackets and the values associated to them, in bold.
9.
Stop words list in Portuguese: de, a, o, que, e, do, da, em, os, no, na, por, as, dos, ao, das, á, ou, ás, com.
10.
Outliers refers to those product titles that do not belong to the Smartphone category.
11.
Examples of removed outliers, in Portuguese: “célula vegetal ampliada aproximadamente 20 mil vezes”, “celula carga 10 kg sensor peso arduino” and “cédula foleada ouro 100 euros coleção notas moedas euro”.
12.
We didn’t remove the token“iphone” and its spelling variations since the format of iPhone titles is very different from the other items in the data set and the experiments removing “iphone” led to results worse than those not removing it.
13.
Since how to measure the inter-annotator agreement of named entities annotation is a debatable task [5], we followed [1] and considered the percentages of agreement as our main metric.
14.
https://huggingface.co/bert-base-multilingual-cased/tree/main.
15.
https://github.com/neuralmind-ai/portuguese-bert.
16.
https://www.linguateca.pt/HAREM/.
17.
From previous experiments, we concluded that the feature-based models as well as the ones trained with the multilingual BERT did not reach results as good as the fine-tuned models trained with BERTimbau-Base, so these options are not described here.
18.
These two configurations were the ones which achieved the best results in previous experiments with product tiles from Fashion category.
19.
https://github.com/mit-nlp/MITIE.

References

HAREM linguateca datasets. https://www.linguateca.pt/HAREM/. Accessed 08 Jan 2021
Repositório oficial - MITIE. https://github.com/mit-nlp/MITIE
Prodigy: An annotation tool for AI, machine learning, and NLP (2017). https://prodi.gy. Accessed 08 Jan 2021
Cheng, X., Bowden, M., Bhange, B.R., Goyal, P., Packer, T., Javed, F.: An end-to-end solution for named entity recognition in ecommerce search. arXiv preprint arXiv:2012.07553 (2020)
Deleger, L., et al.: Building gold standard corpora for medical natural language processing tasks. In: AMIA Annual Symposium Proceedings (2012)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Google Scholar
Geyer, K., Greenfield, K., Mensch, A.C., Simek, O.: Named entity recognition in 140 characters or less (2016)
Google Scholar
Joachims, T., Finley, T., Yu, C.J.: Cutting-plane training of structural SVMs. Mach. Learn. 77, 27–59 (2009)
Article Google Scholar
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119 (2013)
Google Scholar
More, A.: Attribute extraction from product titles in ecommerce. In: Workshop on Enterprise Intelligence - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2016)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing, vol. 14, pp. 1532–1543 (2014)
Google Scholar
Real, L., Johansson, K., Mendes, J., Lopes, B., Oshiro, M.: Generating e-commerce product titles in Portuguese. In: Anais do XLVIII Seminário Integrado de Software e Hardware, pp. 299–304. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/semish.2021.15835. https://sol.sbc.org.br/index.php/semish/article/view/15835
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649 (2019). http://arxiv.org/abs/1909.10649
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020)
Google Scholar
Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: International Conference on Language Resources and Evaluation, pp. 4339–4344 (2018)
Google Scholar
Xu, H., Wang, W., Mao, X., Jiang, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019). https://doi.org/10.18653/v1/P19-1514

Download references

Acknowledgment

This paper and the research behind it would not have been possible without the support of americanas s.a. Digital Lab, specially José Pizani, Ester Campos and Thiago Gouveia Nunes, who closely followed this research. This work is part of the project “Dos dados ao conhecimento: extração e representação de informação no domínio do e-commerce” (Projeto de extensão - UFSCar #23112.000186/2020-97).

Author information

Authors and Affiliations

Federal University of São Carlos, São Carlos, SP, Brazil
Diego F. Silva, Alcides M. e Silva, Bianca M. Lopes, Karina M. Johansson, Fernanda M. Assi, Júlia T. C. de Jesus, Reynold N. Mazo, Daniel Lucrédio & Helena M. Caseli
americanas s.a. Digital Lab, São Paulo, SP, Brazil
Livy Real

Authors

Diego F. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Alcides M. e Silva
View author publications
You can also search for this author in PubMed Google Scholar
Bianca M. Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Karina M. Johansson
View author publications
You can also search for this author in PubMed Google Scholar
Fernanda M. Assi
View author publications
You can also search for this author in PubMed Google Scholar
Júlia T. C. de Jesus
View author publications
You can also search for this author in PubMed Google Scholar
Reynold N. Mazo
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lucrédio
View author publications
You can also search for this author in PubMed Google Scholar
Helena M. Caseli
View author publications
You can also search for this author in PubMed Google Scholar
Livy Real
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego F. Silva .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, D.F. et al. (2021). Named Entity Recognition for Brazilian Portuguese Product Titles. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_36
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics