A distantly supervised approach for recognizing product mentions in user-generated content

Vieira, Henry S.; Silva, Altigran S. da; Calado, Pável; de Moura, Edleno S.

doi:10.1007/s10844-022-00718-4

A distantly supervised approach for recognizing product mentions in user-generated content

Published: 27 May 2022

Volume 59, pages 543–566, (2022)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Henry S. Vieira ORCID: orcid.org/0000-0001-8212-1057¹,
Altigran S. da Silva²,
Pável Calado³ &
…
Edleno S. de Moura²

482 Accesses
Explore all metrics

Abstract

As online purchasing becomes more popular, users trust more information published on social media than on advertisement content. Opinion mining is often applied to social media, and opinion target extraction is one of its main sub-tasks. In this paper, we focus on recognizing target entities related to electronic products. We propose a method called ProdSpot, for training a named entity extractor to identify product mentions in user text based on the distant supervision paradigm. ProdSpot relies only on an unlabeled set of product offer titles and a list of product brand names. Initially, surface forms are identified from product titles. Given a collection of user posts, our method selects sentences that contain at least one surface form to be automatically labeled. A cluster-based filtering strategy is applied to detect and filter out possible mislabelled sentences. Finally, data augmentation is used to produce more general and diverse training. The set of augmented sentences constitutes the training set to train a recognition model. Experiments demonstrate that the training data automatically generated yields results similar to those achieved by a supervised model. Our best result for precision is only 9% lower than a supervised model, while our recall level is higher by approximately 7% in two distinct product categories. Compared to a state-of-the-art supervised method specifically designed to recognize mobile phone names, our method achieved competitive results with F1 values only 4% lower while not requiring user supervision. Our filtering and data augmentation steps directly influence these results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Self-training CRF Method for Recognizing Product Model Mentions in Web Forums

Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods

Using Wikipedia for Cross-Language Named Entity Recognition

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

http://www.howardforums.com/forums.php
The term bootstrap is used here to indicate the automatic creation of training data, rather than statistical sampling.
Part-of-speech tags were created using the Natural Language Toolkit (NLTK).
Our implementation uses scikit-learn (feature extraction) and NumPy.

References

Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55(1), 51–66.
Article Google Scholar
Bloem, C. (2017). 84 Percent of People Trust Online Reviews As Much As Friends. Here’s How to Manage What They See. Web page retrieved on April 20th 2020 and available at https://bit.ly/2XTzAFI.
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., & Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.
Google Scholar
Cao, C., Yan, J., & Li, M. (2018). The impact of different channel of online review on consumers’ online trust. In Proceedings of the Pacific Asia Conference on Information Systems (p. 213).
Choi, B., & Lee, I. (2017). Trust in open versus closed social media: the relative influence of user-and marketer-generated content in social network services on customer trust. Telematics and Informatics, 34(5), 550–559.
Article Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
MATH Google Scholar
Dai, X., Karimi, S., Hachey, B., & Paris, C. (2019). Using similarity measures to select pretraining data for NER. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1460–1470).
Derczynski, L., Maynard, D., Rizzo, G., Van Erp, M., Gorrell, G., Troncy, R., Petrak, J., & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.
Article Google Scholar
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the ACM, 56(4), 82–89.
Article Google Scholar
Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Article MATH Google Scholar
Gillick, L., & Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the International Conference on Acoustics Speech, and Signal Processing (pp. 532–535).
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data Mining, Inference and Prediction. New York: Springer.
Book MATH Google Scholar
Jakob, N., & Gurevych, I. (2010). Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1035–1045).
Köpcke, H., Thor, A., Thomas, S., & Rahm, E. (2012). Tailoring entity resolution for matching product offers. In Proceedings of the International Conference on Extending Database Technology (pp. 545–550).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (pp. 282–289).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260–270).
Lavergne, T., Cappé, O., & Yvon, F. (2010). Practical Very Large Scale CRFs. In Proceedings the Association for Computational Linguistics (pp. 504–513).
Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Workshop SSLNLP (pp. 58–65).
Liu, B. (2011). Web data mining: Exploring hyperlinks, contents, and usage data. Berlin: Springer.
Book MATH Google Scholar
Liu, B. (2012). Sentiment analysis and opinion mining. In G. Hirst (Ed.) Synthesis Lectures on Human Language Technologies (pp. 1–167). Morgan & Claypool Publishers.
Lloret, E., Balahur, A., Góvmez, J.M., Montoyo, A., & Palomar, M. (2012). Towards a unified framework for opinion retrieval, mining and summarization. Journal of Intelligent Information Systems, 39(3), 711–747.
Article Google Scholar
Melli, G. (2014). Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion). In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1670–1678).
Melli, G., & Romming, C. (2012). An overview of the CPROD1 contest on consumer product recognition within user generated postings and normalization against a large product catalog. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 861–864).
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the Annual Meeting of the ACL and the International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011).
Moghaddam, S., & Ester, M. (2013). Opinion mining in online reviews: recent trends. Tutorial at the World Wide Web Conference.
Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196–2209.
Article Google Scholar
Penn, M., & Zalesne, E.K. (2009). New Info Shoppers – The Wall Street Journal. Web page retrieved on June 27th 2014 and available at https://on.wsj.com/2RSR9BQ.
Pogrebnyakov, N. (2018). Unsupervised domain-agnostic identification of product names in social media posts. In Proceedings of the IEEE International Conference on Big Data (pp. 3711–3716).
Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (pp. 147–155).
Rocktäschel, T., Huber, T., Weidlich, M., & Leser, U. (2013). WBI-NER: The Impact of domain-specific features on the performance of identifying and classifying mentions of drugs. In Proceedings of the International Workshop on Semantic Evaluation (pp. 356–363).
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Article MATH Google Scholar
Song, Y., Kim, E., Lee, G.G., & Yi, B.-K. (2004). POSBIOTM-NER in the shared task of bioNLP/NLPBA 2004. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (pp. 100–103).
Tang, J., Fang, Z., & Sun, J. (2015). Incorporating social context and domain knowledge for entity recognition. In Proceedings of the World Wide Web Conference (pp. 517–526).
Teixeira, J., Sarmento, L., & Oliveira, E. (2011). A bootstrapping approach for training a NER with conditional random fields. In Proceedings of the Portuguese Conference on Artificial Intelligence (pp. 664–678).
Vieira, H.S., da Silva, A.S., Calado, P., Cristo, M., & de Moura, E.S. (2016). Towards the effective linking of social media contents to products in e-commerce catalogs. In Proceedings of the ACM International Conference on Information and Knowledge Management (pp. 1049–1058).
Vieira, H.S., da Silva, A.S., Cristo, M., & de Moura, E.S. (2015). A Self-training CRF Method for Recognizing Product Model Mentions in Web Forums. In Proceedings of the European Conference on Information Retrieval (pp. 257–264).
Vlachos, A., & Gasperin, C. (2006). Bootstrapping and evaluating named entity recognition in the biomedical domain. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies BioNLP Workshop on Linking Natural Language and Biology (pp. 138–145).
Wei, J., & Zou, K. (2019). EDA: Easy Data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (pp. 6383–6389).
Wu, S., Fang, Z., & Tang, J. (2012). Accurate product name recognition from user generated content. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 874–877).
Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) (pp. 6256–6268).
Yao, Y., & Sun, A. (2016). Mobile phone name extraction from internet forums: a semi-supervised approach. World Wide Web, 19(5), 783–805.
Article Google Scholar
Zhang, L., & Liu, B. (2011). Entity set expansion in opinion documents. In Proceedings of the ACM Hypertext and Hypermedia (pp. 281–290).
Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review, 22(3), 177–210.
Article MATH Google Scholar

Download references

Funding

This research was partially supported by the following grants. In Brazil: FAPEAM-POSGRAD 2020 (Resolution 002/2020); Coordination for the Improvement of Higher Education Personnel-Brazil (CAPES) Financial Code 001; Project MMBIAS (FAPESP MCTIC/CGI,2020/05173-4); Project SocSens (CAPES/PGCI, 88887.130299/2017-01); Scholarships from FAPEAM/RHTI and CAPES (99999.006956/2015-07) to Henry Vieira; Author's individual grants from CNPq. In Portugal: National funds through Fundação para a Ciência e a Tecnologia (FCT, UID/CEC/50021/2013), and project GoLocal (CMUPERI/TIC/0046/2014).

Author information

Authors and Affiliations

LuizaLabs, São Paulo, Brazil
Henry S. Vieira
Instituto de Computação, UFAM, Manaus, Brazil
Altigran S. da Silva & Edleno S. de Moura
INESC-ID, Instituto Superior Técnico, Lisbon, Portugal
Pável Calado

Authors

Henry S. Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Pável Calado
View author publications
You can also search for this author in PubMed Google Scholar
Edleno S. de Moura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henry S. Vieira.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vieira, H.S., Silva, A.S.d., Calado, P. et al. A distantly supervised approach for recognizing product mentions in user-generated content. J Intell Inf Syst 59, 543–566 (2022). https://doi.org/10.1007/s10844-022-00718-4

Download citation

Received: 16 December 2021
Revised: 09 May 2022
Accepted: 10 May 2022
Published: 27 May 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10844-022-00718-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distantly supervised approach for recognizing product mentions in user-generated content

Abstract

Access this article

Similar content being viewed by others

A Self-training CRF Method for Recognizing Product Model Mentions in Web Forums

Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods

Using Wikipedia for Cross-Language Named Entity Recognition

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A distantly supervised approach for recognizing product mentions in user-generated content

Abstract

Access this article

Similar content being viewed by others

A Self-training CRF Method for Recognizing Product Model Mentions in Web Forums

Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods

Using Wikipedia for Cross-Language Named Entity Recognition

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation