Skip to main content
Log in

A distantly supervised approach for recognizing product mentions in user-generated content

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

As online purchasing becomes more popular, users trust more information published on social media than on advertisement content. Opinion mining is often applied to social media, and opinion target extraction is one of its main sub-tasks. In this paper, we focus on recognizing target entities related to electronic products. We propose a method called ProdSpot, for training a named entity extractor to identify product mentions in user text based on the distant supervision paradigm. ProdSpot relies only on an unlabeled set of product offer titles and a list of product brand names. Initially, surface forms are identified from product titles. Given a collection of user posts, our method selects sentences that contain at least one surface form to be automatically labeled. A cluster-based filtering strategy is applied to detect and filter out possible mislabelled sentences. Finally, data augmentation is used to produce more general and diverse training. The set of augmented sentences constitutes the training set to train a recognition model. Experiments demonstrate that the training data automatically generated yields results similar to those achieved by a supervised model. Our best result for precision is only 9% lower than a supervised model, while our recall level is higher by approximately 7% in two distinct product categories. Compared to a state-of-the-art supervised method specifically designed to recognize mobile phone names, our method achieved competitive results with F1 values only 4% lower while not requiring user supervision. Our filtering and data augmentation steps directly influence these results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. http://www.howardforums.com/forums.php

  2. The term bootstrap is used here to indicate the automatic creation of training data, rather than statistical sampling.

  3. Part-of-speech tags were created using the Natural Language Toolkit (NLTK).

  4. Our implementation uses scikit-learn (feature extraction) and NumPy.

References

  • Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55(1), 51–66.

    Article  Google Scholar 

  • Bloem, C. (2017). 84 Percent of People Trust Online Reviews As Much As Friends. Here’s How to Manage What They See. Web page retrieved on April 20th 2020 and available at https://bit.ly/2XTzAFI.

  • Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., & Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

    Google Scholar 

  • Cao, C., Yan, J., & Li, M. (2018). The impact of different channel of online review on consumers’ online trust. In Proceedings of the Pacific Asia Conference on Information Systems (p. 213).

  • Choi, B., & Lee, I. (2017). Trust in open versus closed social media: the relative influence of user-and marketer-generated content in social network services on customer trust. Telematics and Informatics, 34(5), 550–559.

    Article  Google Scholar 

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.

    MATH  Google Scholar 

  • Dai, X., Karimi, S., Hachey, B., & Paris, C. (2019). Using similarity measures to select pretraining data for NER. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1460–1470).

  • Derczynski, L., Maynard, D., Rizzo, G., Van Erp, M., Gorrell, G., Troncy, R., Petrak, J., & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.

    Article  Google Scholar 

  • Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the ACM, 56(4), 82–89.

    Article  Google Scholar 

  • Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.

    Article  MATH  Google Scholar 

  • Gillick, L., & Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the International Conference on Acoustics Speech, and Signal Processing (pp. 532–535).

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data Mining, Inference and Prediction. New York: Springer.

    Book  MATH  Google Scholar 

  • Jakob, N., & Gurevych, I. (2010). Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1035–1045).

  • Köpcke, H., Thor, A., Thomas, S., & Rahm, E. (2012). Tailoring entity resolution for matching product offers. In Proceedings of the International Conference on Extending Database Technology (pp. 545–550).

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (pp. 282–289).

  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260–270).

  • Lavergne, T., Cappé, O., & Yvon, F. (2010). Practical Very Large Scale CRFs. In Proceedings the Association for Computational Linguistics (pp. 504–513).

  • Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Workshop SSLNLP (pp. 58–65).

  • Liu, B. (2011). Web data mining: Exploring hyperlinks, contents, and usage data. Berlin: Springer.

    Book  MATH  Google Scholar 

  • Liu, B. (2012). Sentiment analysis and opinion mining. In G. Hirst (Ed.) Synthesis Lectures on Human Language Technologies (pp. 1–167). Morgan & Claypool Publishers.

  • Lloret, E., Balahur, A., Góvmez, J.M., Montoyo, A., & Palomar, M. (2012). Towards a unified framework for opinion retrieval, mining and summarization. Journal of Intelligent Information Systems, 39(3), 711–747.

    Article  Google Scholar 

  • Melli, G. (2014). Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion). In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1670–1678).

  • Melli, G., & Romming, C. (2012). An overview of the CPROD1 contest on consumer product recognition within user generated postings and normalization against a large product catalog. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 861–864).

  • Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the Annual Meeting of the ACL and the International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011).

  • Moghaddam, S., & Ester, M. (2013). Opinion mining in online reviews: recent trends. Tutorial at the World Wide Web Conference.

  • Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196–2209.

    Article  Google Scholar 

  • Penn, M., & Zalesne, E.K. (2009). New Info Shoppers – The Wall Street Journal. Web page retrieved on June 27th 2014 and available at https://on.wsj.com/2RSR9BQ.

  • Pogrebnyakov, N. (2018). Unsupervised domain-agnostic identification of product names in social media posts. In Proceedings of the IEEE International Conference on Big Data (pp. 3711–3716).

  • Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (pp. 147–155).

  • Rocktäschel, T., Huber, T., Weidlich, M., & Leser, U. (2013). WBI-NER: The Impact of domain-specific features on the performance of identifying and classifying mentions of drugs. In Proceedings of the International Workshop on Semantic Evaluation (pp. 356–363).

  • Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.

    Article  MATH  Google Scholar 

  • Song, Y., Kim, E., Lee, G.G., & Yi, B.-K. (2004). POSBIOTM-NER in the shared task of bioNLP/NLPBA 2004. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (pp. 100–103).

  • Tang, J., Fang, Z., & Sun, J. (2015). Incorporating social context and domain knowledge for entity recognition. In Proceedings of the World Wide Web Conference (pp. 517–526).

  • Teixeira, J., Sarmento, L., & Oliveira, E. (2011). A bootstrapping approach for training a NER with conditional random fields. In Proceedings of the Portuguese Conference on Artificial Intelligence (pp. 664–678).

  • Vieira, H.S., da Silva, A.S., Calado, P., Cristo, M., & de Moura, E.S. (2016). Towards the effective linking of social media contents to products in e-commerce catalogs. In Proceedings of the ACM International Conference on Information and Knowledge Management (pp. 1049–1058).

  • Vieira, H.S., da Silva, A.S., Cristo, M., & de Moura, E.S. (2015). A Self-training CRF Method for Recognizing Product Model Mentions in Web Forums. In Proceedings of the European Conference on Information Retrieval (pp. 257–264).

  • Vlachos, A., & Gasperin, C. (2006). Bootstrapping and evaluating named entity recognition in the biomedical domain. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies BioNLP Workshop on Linking Natural Language and Biology (pp. 138–145).

  • Wei, J., & Zou, K. (2019). EDA: Easy Data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (pp. 6383–6389).

  • Wu, S., Fang, Z., & Tang, J. (2012). Accurate product name recognition from user generated content. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 874–877).

  • Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) (pp. 6256–6268).

  • Yao, Y., & Sun, A. (2016). Mobile phone name extraction from internet forums: a semi-supervised approach. World Wide Web, 19(5), 783–805.

    Article  Google Scholar 

  • Zhang, L., & Liu, B. (2011). Entity set expansion in opinion documents. In Proceedings of the ACM Hypertext and Hypermedia (pp. 281–290).

  • Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review, 22(3), 177–210.

    Article  MATH  Google Scholar 

Download references

Funding

This research was partially supported by the following grants. In Brazil: FAPEAM-POSGRAD 2020 (Resolution 002/2020); Coordination for the Improvement of Higher Education Personnel-Brazil (CAPES) Financial Code 001; Project MMBIAS (FAPESP MCTIC/CGI,2020/05173-4); Project SocSens (CAPES/PGCI, 88887.130299/2017-01); Scholarships from FAPEAM/RHTI and CAPES (99999.006956/2015-07) to Henry Vieira; Author's individual grants from CNPq. In Portugal: National funds through Fundação para a Ciência e a Tecnologia (FCT, UID/CEC/50021/2013), and project GoLocal (CMUPERI/TIC/0046/2014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henry S. Vieira.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vieira, H.S., Silva, A.S.d., Calado, P. et al. A distantly supervised approach for recognizing product mentions in user-generated content. J Intell Inf Syst 59, 543–566 (2022). https://doi.org/10.1007/s10844-022-00718-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-022-00718-4

Keywords

Navigation