Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics

Zaman, Nohel; Goldberg, David M.; Gruss, Richard J.; Abrahams, Alan S.; Srisawas, Siriporn; Ractham, Peter; Şeref, Michelle M.H.

doi:10.1007/s10796-021-10122-y

Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics

Published: 30 March 2021

Volume 24, pages 1265–1285, (2022)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Nohel Zaman¹,
David M. Goldberg ORCID: orcid.org/0000-0003-2617-5143²,
Richard J. Gruss³,
Alan S. Abrahams⁴,
Siriporn Srisawas^5,6,
Peter Ractham⁵ &
…
Michelle M.H. Şeref⁴

708 Accesses
9 Citations
Explore all metrics

Abstract

Online reviews contain many vital insights for quality management, but the volume of content makes identifying defect-related discussion difficult. This paper critically assesses multiple approaches for detecting defect-related discussion, ranging from out-of-the-box sentiment analyses to supervised and unsupervised machine-learned defect terms. We examine reviews from 25 product and service categories to assess each method’s performance. We examine each approach across the broad cross-section of categories as well as when tailored to a singular category of study. Surprisingly, we found that negative sentiment was often a poor predictor of defect-related discussion. Terms generated with unsupervised topic modeling tended to correspond to generic product discussions rather than defect-related discussion. Supervised learning techniques outperformed the other text analytic techniques in our cross-category analysis, and they were especially effective when confined to a single category of study. Our work suggests a need for category-specific text analyses to take full advantage of consumer-driven quality intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defective products identification framework using online reviews

Article 23 June 2021

Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews

AutoODC: Automated generation of orthogonal defect classifications

Article 03 June 2014

Notes

The unexpectedly high proportion of defects in high-star reviews may, in part, be due to active moderation by online retailers of extremely negative reviews, which may reduce total defect reports, particularly in low-star reviews. For instance, Amazon does not post submitted reviews that “violate community guidelines”. Manufacturers typically cannot access moderated reviews submitted on a retailer’s website, since these have been suppressed; such reviews may have contained additional defect reports.
Only the best three of the available scoring methods, as measured by AUC, were depicted in Figure 2 to enhance readability. A chi-squared test of the proportion of defects in the top 200 reviews for the two best sentiment methods (AFINN versus SentiStrength) indicates they did not significantly differ (p = 0.18); we chose AFINN as the best sentiment method due to its marginally better results. Similarly, smoke bigrams were marginally better performing than smoke unigrams but did not significantly differ from the smoke unigrams via a chi-squared test (p = 0.14).
As benchmarks, we also compared these techniques to more general machine learning techniques, namely neural networks, naïve Bayes, and support vector machines (SVM). We implemented neural networks in JMP Pro, which uses a penalized Gaussian (least squares) maximum likelihood function. We initially used a single hidden layer, and we later found that that adding an additional hidden layer did not improve results. We implemented SVM using the scikit-learn Python library, and we used the default settings of a 1.0 penalty parameter using a radial basis function (RBF) kernel with a polynomial kernel function. We found that neural networks, naïve Bayes, and SVM yielded 156, 124, and 154 true positives (defects) in the top 200-ranked reviews of the holdout set, and we observed AUC values of 0.58, 0.54, and 0.58 respectively. As such, these techniques did not outperform the other methods that we attempted. However, smoke terms are advantageous in that they are more easily interpretable and explainable, whereas these other methods may be “black boxes” for which it is difficult to articulate clear reasoning as to each prediction.
For interpretability, all variables were scaled from 0 to 1 where 0 indicates lower defect likelihood and 1 indicated higher defect likelihood.

References

Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery from social media. Decision Support Systems, 54(1), 87–97.
Article Google Scholar
Abrahams, A. S., Jiao, J., Fan, W., Wang, G. A., & Zhang, Z. (2013). What's buzzing in the blizzard of buzz? Automotive component isolation in social media postings. Decision Support Systems, 55(4), 871–882.
Article Google Scholar
Abrahams, A. S., Fan, W., Wang, G. A., Zhang, Z. J., & Jiao, J. (2015). An integrated text analytic framework for product defect discovery. Production and Operations Management, 24(6), 975–990.
Article Google Scholar
Adams, D. Z., Gruss, R., & Abrahams, A. S. (2017). Automated discovery of safety and efficacy concerns for joint & muscle pain relief treatments from online reviews. International Journal of Medical Informatics, 100, 108–120.
Article Google Scholar
Baumeister, R. F., Bratslavsky, E., Finkenauer, C., & Vohs, K. D. (2001). Bad is stronger than good. Review of General Psychology, 5(4), 323–370.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Brahma, A., Goldberg, D. M., Zaman, N., & Aloiso, M. (2021). Automated mortgage origination delay detection from textual conversations. Decision Support Systems, 140, 113433.
Article Google Scholar
Chen, Y., Ganesan, S., & Liu, Y. (2009). Does a firm's product-recall strategy affect its financial value? An examination of strategic alternatives during product-harm crises. Journal of Marketing, 73(6), 214–226.
Article Google Scholar
Chong, A. Y. L., Khong, K. W., Ma, T., McCabe, S., & Wang, Y. (2018). Analyzing key influences of tourists’ acceptance of online reviews in travel decisions. Internet Research, 28, 564–586.
Article Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.
Article Google Scholar
Cu, T., Schneider, H., & Van Scotter, J. (2017). How does sentiment content of product reviews make diffusion different? Journal of Computer Information Systems, 1–9.
Cui, G., Lui, H.-K., & Guo, X. (2012). The effect of online consumer reviews on new product sales. International Journal of Electronic Commerce, 17(1), 39–58.
Article Google Scholar
Das, A. S., Mehta, S., & Subramaniam, L. V. (2017). AnnoFin–A hybrid algorithm to annotate financial text. Expert Systems with Applications, 88, 270–275.
Article Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Deming, W. E., & Edwards, D. W. (1982). Quality, productivity, and competitive position (Vol. 183). Cambridge, MA: Massachusetts Institute of Technology, Center for advanced engineering study.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint.
Duan, W., Gu, B., & Whinston, A. (2008). Do online reviews matter?—An empirical investigation of panel data. Decision Support Systems, 45(4), 1007–1016.
Article Google Scholar
Eliashberg, J., Hui, S. K., & Zhang, Z. J. (2014). Assessing box office performance using movie scripts: A kernel-based approach. IEEE Transactions on Knowledge and Data Engineering, 26(11), 2639–2648.
Article Google Scholar
Fan, W., & Gordon, M. D. (2014). The power of social media analytics. Communications of the ACM, 57(6), 74–81.
Article Google Scholar
Fan, W., Gordon, M. D., & Pathak, P. (2005). Effective profiling of consumer information retrieval needs: A unified framework and empirical comparison. Decision Support Systems, 40(2), 213–233.
Article Google Scholar
Fleiss, J. L., Levin, B., & Paik, M. C. (2013). Statistical methods for rates and proportions. Hoboken: Wiley.
Google Scholar
Fornell, C., Johnson, M. D., Anderson, E. W., Cha, J., & Bryant, B. E. (1996). The American customer satisfaction index: Nature, purpose, and findings. The Journal of Marketing, 60, 7–18.
Article Google Scholar
Ghiassi, M., Zimbra, D., & Lee, S. (2016). Targeted twitter sentiment analysis for brands using supervised feature engineering and the dynamic architecture for artificial neural networks. Journal of Management Information Systems, 33(4), 1034–1058.
Article Google Scholar
Goldberg, D. M., & Abrahams, A. S. (2018). A Tabu search heuristic for smoke term curation in safety defect discovery. Decision Support Systems, 105, 52–65.
Article Google Scholar
Goldberg, D. M., Khan, S., Zaman, N., Gruss, R. J., & Abrahams, A. S. (2021). Text mining approaches for postmarket food safety surveillance using online media. Risk Analysis.
Gopal, R., Marsden, J. R., & Vanthienen, J. (2011). Information mining—Reflections on recent advancements and the road ahead in data, text, and media mining. In: Elsevier.
Guo, Y., Barnes, S. J., & Jia, Q. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation. Tourism Management, 59, 467–483.
Article Google Scholar
He, W., Tian, X., Chen, Y., & Chong, D. (2016). Actionable social media competitive analytics for understanding customer experiences. Journal of Computer Information Systems, 56(2), 145–155.
Article Google Scholar
Hendricks, K. B., & Singhal, V. R. (1997). Does implementing an effective TQM program actually improve operating performance? Empirical evidence from firms that have won quality awards. Management Science, 43(9), 1258–1274.
Article Google Scholar
Hendricks, K. B., & Singhal, V. R. (2001). The long-run stock price performance of firms with effective TQM programs. Management Science, 47(3), 359–368.
Article Google Scholar
Holton, C. (2009). Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem. Decision Support Systems, 46(4), 853–864.
Article Google Scholar
Hora, M., Bapuji, H., & Roth, A. V. (2011). Safety hazard and time to recall: The role of recall strategy, product defect type, and supply chain player in the US toy industry. Journal of Operations Management, 29(7–8), 766–777.
Article Google Scholar
Hu, N., Pavlou, P. A., & Zhang, J. (2006). Can online reviews reveal a product's true quality? Empirical findings and analytical modeling of online word-of-mouth communication. Paper presented at the proceedings of the 7th ACM Conference on Electronic Commerce.
Hu, N., Liu, L., & Zhang, J. J. (2008). Do online reviews affect product sales? The role of reviewer characteristics and temporal effects. Information Technology & Management, 9(3), 201–214.
Article Google Scholar
Hu, N., Pavlou, P. A., & Zhang, J. J. (2009). Why do online product reviews have a J-shaped distribution? Overcoming biases in online word-of-mouth communication. Communications of the ACM, 52(10), 144–147.
Article Google Scholar
Hu, N., Bose, I., Koh, N. S., & Liu, L. (2012). Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision Support Systems, 52(3), 674–684.
Article Google Scholar
Hu, N., Koh, N. S., & Reddy, S. K. (2014). Ratings lead you to the product, reviews help you clinch it? The mediating role of online review sentiments on product sales. Decision Support Systems, 57, 42–53.
Article Google Scholar
Jarrell, G., & Peltzman, S. (1985). The impact of product recalls on the wealth of sellers. Journal of Political Economy, 93(3), 512–536.
Article Google Scholar
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.
Article Google Scholar
Jung, Y., & Suh, Y. (2019). Mining the voice of employees: A text mining approach to identifying and analyzing job satisfaction factors from online employee reviews. Decision Support Systems., 123, 113074.
Article Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Article Google Scholar
Lau, R. Y., Li, C., & Liao, S. S. (2014). Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decision Support Systems, 65, 80–94.
Article Google Scholar
Law, D., Gruss, R., & Abrahams, A. S. (2017). Automated defect discovery for dishwasher appliances from online consumer reviews. Expert Systems with Applications, 67, 84–94.
Article Google Scholar
Lee, J., Park, D.-H., & Han, I. (2008). The effect of negative online consumer reviews on product attitude: An information processing view. Electronic Commerce Research and Applications, 7(3), 341–352.
Article Google Scholar
Lee, S., Song, J., & Kim, Y. (2010). An empirical comparison of four text mining methods. Journal of Computer Information Systems, 51(1), 1–10.
Google Scholar
Liu, Y., Jiang, C., & Zhao, H. (2018). Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums. Decision Support Systems, 105, 1–12.
Article Google Scholar
Lyles, M. A., Flynn, B. B., & Frohlich, M. T. (2008). All supply chains don't flow through: Understanding supply chain issues in product recalls. Management and Organization Review, 4(2), 167–182.
Article Google Scholar
McAuley, J., Pandey, R., & Leskovec, J. (2015). Inferring networks of substitutable and complementary products. Paper presented at the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint.
Moro, S., Cortez, P., & Rita, P. (2015). Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation. Expert Systems with Applications, 42(3), 1314–1324.
Article Google Scholar
Mostafa, M. M. (2013). More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications, 40(10), 4241–4251.
Article Google Scholar
Mummalaneni, V., Gruss, R., Goldberg, D. M., Ehsani, J. P., & Abrahams, A. S. (2018). Social media analytics for quality surveillance and safety hazard detection in baby cribs. Safety Science, 104, 260–268.
Article Google Scholar
Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. Paper presented at the 20th annual international ACM SIGIR conference on Research and Development in information retrieval.
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Paper presented at the 1st Workshop on Making Sense of Microposts.
Oberreuter, G., & VeláSquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763.
Article Google Scholar
Park, C., & Lee, T. M. (2009). Information direction, website reputation and eWOM effect: A moderating role of product type. Journal of Business Research, 62(1), 61–67.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Paper presented at the Conference on Empirical Methods in Natural Language Processing.
Phillips, P., Zigan, K., Silva, M. M. S., & Schegg, R. (2015). The interactive effects of online reviews on the determinants of Swiss hotel performance: A neural network analysis. Tourism Management, 50, 130–141.
Article Google Scholar
Porter, M. E., & Van der Linde, C. (1995). Toward a new conception of the environment-competitiveness relationship. Journal of Economic Perspectives, 9(4), 97–118.
Article Google Scholar
Qi, J., Zhang, Z., Jeon, S., & Zhou, Y. (2016). Mining customer requirements from online reviews: A product improvement perspective. Information & Management, 53(8), 951–963.
Article Google Scholar
Qiao, Z., Zhang, X., Zhou, M., Wang, G. A., & Fan, W. (2017). A domain oriented LDA model for mining product defects from online customer reviews. Paper presented at the 50th Hawaii International Conference on System Sciences.
Rhee, M., & Haunschild, P. R. (2006). The liability of good reputation: A study of product recalls in the US automobile industry. Organization Science, 17(1), 101–117.
Article Google Scholar
Shi, D., Guan, J., Zurada, J., & Manikas, A. (2017). A data-mining approach to identification of risk factors in safety management systems. Journal of Management Information Systems, 34(4), 1054–1081.
Article Google Scholar
Stern, H. (1962). The significance of impulse buying today. The Journal of Marketing, 26, 59–62.
Article Google Scholar
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558.
Article Google Scholar
Tirunillai, S., & Tellis, G. J. (2014). Mining marketing meaning from online chatter: Strategic brand analysis of big data using latent dirichlet allocation. Journal of Marketing Research, 51(4), 463–479.
Article Google Scholar
Winkler, M., Abrahams, A. S., Gruss, R., & Ehsani, J. P. (2016). Toy safety surveillance from online reviews. Decision Support Systems, 90, 23–32.
Article Google Scholar
Yu, Y., Duan, W., & Cao, Q. (2013). The impact of social and conventional media on firm equity value: A sentiment analysis approach. Decision Support Systems, 55(4), 919–926.
Article Google Scholar
Zaman, N., Goldberg, D. M., Abrahams, A. S., & Essig, R. A. (2020). Facebook hospital reviews: Automated service quality detection and relationships with patient satisfaction. Decision Sciences.
Zhang, Z. (2008). Mining relational data from text: From strictly supervised to weakly supervised learning. Information Systems, 33(3), 300–314.
Article Google Scholar
Zhao, W. X., Jiang, J., Yan, H., & Li, X. (2010). Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. Paper presented at the 2010 Conference on Empirical Methods in Natural Language Processing.

Download references

Acknowledgments

Alan S. Abrahams and Peter Ractham gratefully acknowledge support for this work from Thammasat University in the form of the Bualuang ASEAN Fellowship.

Author information

Authors and Affiliations

Department of Information Systems and Business Analytics, College of Business Administration, Loyola Marymount University, 1 LMU Drive, Los Angeles, CA, 90045, USA
Nohel Zaman
Department of Management Information Systems, Fowler College of Business, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA
David M. Goldberg
Department of Management, College of Business and Economics, Radford University, P.O. Box 6954, Radford, VA, 24142, USA
Richard J. Gruss
Department of Business Information Technology, Pamplin College of Business, Virginia Tech, 880 West Campus Drive, Pamplin Hall Suite 1007, Blacksburg, VA, 24061, USA
Alan S. Abrahams & Michelle M.H. Şeref
Department of Management Information Systems, Thammasat Business School, Thammasat University, 2 Prachan Road, Bangkok, 10200, Thailand
Siriporn Srisawas & Peter Ractham
Centre of Excellence in Operations and Information Management, Thammasat Business School, 2 Prachan Road, Bangkok, 10200, Thailand
Siriporn Srisawas

Authors

Nohel Zaman
View author publications
You can also search for this author in PubMed Google Scholar
David M. Goldberg
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Gruss
View author publications
You can also search for this author in PubMed Google Scholar
Alan S. Abrahams
View author publications
You can also search for this author in PubMed Google Scholar
Siriporn Srisawas
View author publications
You can also search for this author in PubMed Google Scholar
Peter Ractham
View author publications
You can also search for this author in PubMed Google Scholar
Michelle M.H. Şeref
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David M. Goldberg.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

To curate smoke terms, we utilize the CC score algorithm (Fan et al., 2005), an information retrieval technique that prior works have shown to be quite effective in smoke term curation (Abrahams et al., 2012, 2013, 2015; Goldberg & Abrahams, 2018). As the chi-squared distribution is used to test for the independence between two variables in statistics, information retrieval has utilized this principle to examine relationships between documents and words that they contain. Ng et al. (1997) first suggested a means of using a one-sided chi-squared test to select words or phrases associated with a relevant classification of documents; Fan et al. (2005) later expanded upon this technique. The CC score algorithm generates a relevance score for each term (word or phrase) in a corpus, where higher scores indicate more relevant terms that may be predictive of the target classification. We first distinguish between relevant documents (in our study, reviews) from the target classification (in our study, defect-related reviews) and non-relevant documents not from the target classification (in our study, non-defect-related reviews). Consider Table 8, which defines the relationships between document relevance and inclusion/exclusion of terms.

Table 8 Contingency table for CC score algorithm (adapted from Fan et al. (2005))

Full size table

Given this contingency table, terms are given higher scores when they are especially frequent in documents that are relevant and especially infrequent in documents that are irrelevant. The CC score algorithm defines this relevance as follows for each term in the corpus:

$$ Relevance=\frac{\sqrt{N}\times \left( AD- CB\right)}{\sqrt{\left(A+B\right)\times \left(C+D\right)}} $$

(1)

The CC score algorithm generates a relevance score for each term in the corpus such that scores with high relevance scores occur frequently in relevant documents and infrequently in irrelevant documents. Thus, we may use high-scoring terms as predictors of relevance in unseen documents. After using the CC score algorithm to generate relevance scores for each smoke term, the lead author analyzed the top-ranking terms to remove any stop words (common English words like “a,” “an,” and “the”), common brands names, and/or common product (or service) categories (Abrahams et al., 2012, 2013, 2015). A coauthor further reviewed and reverified these decisions to ensure accuracy. The retained terms are referred to as smoke terms, and each set of smoke terms is referred to as a smoke term list.

When analyzing unseen reviews (e.g., our holdout set), we use the appropriate smoke term list to generate “smoke scores” for each review. For a given review, we determine this smoke score by searching for any occurrences of the smoke terms within that review. Each time we observe an occurrence of a smoke term, we increment that review’s smoke score by that smoke term’s relevance score as determined by the CC score algorithm. Finally, using these smoke scores, we can prioritize the reviews believed to refer to defects. We can sort the reviews from the highest smoke score to the lowest smoke score, where the highest smoke scores are the most likely to refer to defects.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zaman, N., Goldberg, D.M., Gruss, R.J. et al. Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics. Inf Syst Front 24, 1265–1285 (2022). https://doi.org/10.1007/s10796-021-10122-y

Download citation

Accepted: 02 March 2021
Published: 30 March 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s10796-021-10122-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics

Abstract

Access this article

Similar content being viewed by others

Defective products identification framework using online reviews

Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews

AutoODC: Automated generation of orthogonal defect classifications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics

Abstract

Access this article

Similar content being viewed by others

Defective products identification framework using online reviews

Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews

AutoODC: Automated generation of orthogonal defect classifications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation