Abstract
Improving the interaction between consumers and marketplaces, focusing on reaching higher conversion rates is one of the main goals of e-commerce companies. Offering better results for user queries is mandatory to improve user experience and convert it into purchases. This paper investigates how named entity recognition can extract relevant attributes from product titles to derive better filters for user queries. We conducted several experiments based on MITIE and BERT applied to smartphones/cellphones product titles from the largest Brazilian retail e-commerce. Both of our strategies achieve outstanding results with a general F1 score of around 95%. We concluded that using a classical machine learning pipeline is still more useful than relying on large pre-trained language models, considering the model’s throughput and efficiency. Future work may focus on evaluating the scalability and reusability capacity of both approaches.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
In English: Samsung Galaxy A01 Smartphone, 32 GB, 2 GB RAM, 5.7\(^{\prime \prime }\) Infinite Screen, 13MP Dual Rear Camera (Main) + 2MP (Depth), 5MP Front, 3000 mAh Battery, Dual Chip, Android - Blue.
- 3.
In English: Moto G9 Play 64 GB Smartphone Dual Chip Android 10 Screen 6.5\(^{\prime \prime }\) Qualcomm Snapdragon 4G Camera 48MP+2MP+2MP - Turquoise Green.
- 4.
In English: iPhone SE 128 GB Black iOS 4G Wi-Fi Screen 4.7\(^{\prime \prime }\) 12MP + 7MP Camera - Apple.
- 5.
WIT stands for “What Is This” and is used to define the product being sold.
- 6.
The Smartphone category encompasses cell phones, smartphones and iPhones sold by the largest Brazilian e-commerce marketplace, B2W Digital.
- 7.
- 8.
In this paper, the tags associated with the attributes are presented between brackets and the values associated to them, in bold.
- 9.
Stop words list in Portuguese: de, a, o, que, e, do, da, em, os, no, na, por, as, dos, ao, das, á, ou, ás, com.
- 10.
Outliers refers to those product titles that do not belong to the Smartphone category.
- 11.
Examples of removed outliers, in Portuguese: “célula vegetal ampliada aproximadamente 20 mil vezes”, “celula carga 10 kg sensor peso arduino” and “cédula foleada ouro 100 euros coleção notas moedas euro”.
- 12.
We didn’t remove the token“iphone” and its spelling variations since the format of iPhone titles is very different from the other items in the data set and the experiments removing “iphone” led to results worse than those not removing it.
- 13.
- 14.
- 15.
- 16.
- 17.
From previous experiments, we concluded that the feature-based models as well as the ones trained with the multilingual BERT did not reach results as good as the fine-tuned models trained with BERTimbau-Base, so these options are not described here.
- 18.
These two configurations were the ones which achieved the best results in previous experiments with product tiles from Fashion category.
- 19.
References
HAREM linguateca datasets. https://www.linguateca.pt/HAREM/. Accessed 08 Jan 2021
Repositório oficial - MITIE. https://github.com/mit-nlp/MITIE
Prodigy: An annotation tool for AI, machine learning, and NLP (2017). https://prodi.gy. Accessed 08 Jan 2021
Cheng, X., Bowden, M., Bhange, B.R., Goyal, P., Packer, T., Javed, F.: An end-to-end solution for named entity recognition in ecommerce search. arXiv preprint arXiv:2012.07553 (2020)
Deleger, L., et al.: Building gold standard corpora for medical natural language processing tasks. In: AMIA Annual Symposium Proceedings (2012)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Geyer, K., Greenfield, K., Mensch, A.C., Simek, O.: Named entity recognition in 140 characters or less (2016)
Joachims, T., Finley, T., Yu, C.J.: Cutting-plane training of structural SVMs. Mach. Learn. 77, 27–59 (2009)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119 (2013)
More, A.: Attribute extraction from product titles in ecommerce. In: Workshop on Enterprise Intelligence - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing, vol. 14, pp. 1532–1543 (2014)
Real, L., Johansson, K., Mendes, J., Lopes, B., Oshiro, M.: Generating e-commerce product titles in Portuguese. In: Anais do XLVIII Seminário Integrado de Software e Hardware, pp. 299–304. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/semish.2021.15835. https://sol.sbc.org.br/index.php/semish/article/view/15835
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649 (2019). http://arxiv.org/abs/1909.10649
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020)
Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: International Conference on Language Resources and Evaluation, pp. 4339–4344 (2018)
Xu, H., Wang, W., Mao, X., Jiang, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019). https://doi.org/10.18653/v1/P19-1514
Acknowledgment
This paper and the research behind it would not have been possible without the support of americanas s.a. Digital Lab, specially José Pizani, Ester Campos and Thiago Gouveia Nunes, who closely followed this research. This work is part of the project “Dos dados ao conhecimento: extração e representação de informação no domínio do e-commerce” (Projeto de extensão - UFSCar #23112.000186/2020-97).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, D.F. et al. (2021). Named Entity Recognition for Brazilian Portuguese Product Titles. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)