Skip to main content
Log in

Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Online reviews contain many vital insights for quality management, but the volume of content makes identifying defect-related discussion difficult. This paper critically assesses multiple approaches for detecting defect-related discussion, ranging from out-of-the-box sentiment analyses to supervised and unsupervised machine-learned defect terms. We examine reviews from 25 product and service categories to assess each method’s performance. We examine each approach across the broad cross-section of categories as well as when tailored to a singular category of study. Surprisingly, we found that negative sentiment was often a poor predictor of defect-related discussion. Terms generated with unsupervised topic modeling tended to correspond to generic product discussions rather than defect-related discussion. Supervised learning techniques outperformed the other text analytic techniques in our cross-category analysis, and they were especially effective when confined to a single category of study. Our work suggests a need for category-specific text analyses to take full advantage of consumer-driven quality intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The unexpectedly high proportion of defects in high-star reviews may, in part, be due to active moderation by online retailers of extremely negative reviews, which may reduce total defect reports, particularly in low-star reviews. For instance, Amazon does not post submitted reviews that “violate community guidelines”. Manufacturers typically cannot access moderated reviews submitted on a retailer’s website, since these have been suppressed; such reviews may have contained additional defect reports.

  2. Only the best three of the available scoring methods, as measured by AUC, were depicted in Figure 2 to enhance readability. A chi-squared test of the proportion of defects in the top 200 reviews for the two best sentiment methods (AFINN versus SentiStrength) indicates they did not significantly differ (p = 0.18); we chose AFINN as the best sentiment method due to its marginally better results. Similarly, smoke bigrams were marginally better performing than smoke unigrams but did not significantly differ from the smoke unigrams via a chi-squared test (p = 0.14).

  3. As benchmarks, we also compared these techniques to more general machine learning techniques, namely neural networks, naïve Bayes, and support vector machines (SVM). We implemented neural networks in JMP Pro, which uses a penalized Gaussian (least squares) maximum likelihood function. We initially used a single hidden layer, and we later found that that adding an additional hidden layer did not improve results. We implemented SVM using the scikit-learn Python library, and we used the default settings of a 1.0 penalty parameter using a radial basis function (RBF) kernel with a polynomial kernel function. We found that neural networks, naïve Bayes, and SVM yielded 156, 124, and 154 true positives (defects) in the top 200-ranked reviews of the holdout set, and we observed AUC values of 0.58, 0.54, and 0.58 respectively. As such, these techniques did not outperform the other methods that we attempted. However, smoke terms are advantageous in that they are more easily interpretable and explainable, whereas these other methods may be “black boxes” for which it is difficult to articulate clear reasoning as to each prediction.

  4. For interpretability, all variables were scaled from 0 to 1 where 0 indicates lower defect likelihood and 1 indicated higher defect likelihood.

References

  • Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery from social media. Decision Support Systems, 54(1), 87–97.

    Article  Google Scholar 

  • Abrahams, A. S., Jiao, J., Fan, W., Wang, G. A., & Zhang, Z. (2013). What's buzzing in the blizzard of buzz? Automotive component isolation in social media postings. Decision Support Systems, 55(4), 871–882.

    Article  Google Scholar 

  • Abrahams, A. S., Fan, W., Wang, G. A., Zhang, Z. J., & Jiao, J. (2015). An integrated text analytic framework for product defect discovery. Production and Operations Management, 24(6), 975–990.

    Article  Google Scholar 

  • Adams, D. Z., Gruss, R., & Abrahams, A. S. (2017). Automated discovery of safety and efficacy concerns for joint & muscle pain relief treatments from online reviews. International Journal of Medical Informatics, 100, 108–120.

    Article  Google Scholar 

  • Baumeister, R. F., Bratslavsky, E., Finkenauer, C., & Vohs, K. D. (2001). Bad is stronger than good. Review of General Psychology, 5(4), 323–370.

    Article  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Brahma, A., Goldberg, D. M., Zaman, N., & Aloiso, M. (2021). Automated mortgage origination delay detection from textual conversations. Decision Support Systems, 140, 113433.

    Article  Google Scholar 

  • Chen, Y., Ganesan, S., & Liu, Y. (2009). Does a firm's product-recall strategy affect its financial value? An examination of strategic alternatives during product-harm crises. Journal of Marketing, 73(6), 214–226.

    Article  Google Scholar 

  • Chong, A. Y. L., Khong, K. W., Ma, T., McCabe, S., & Wang, Y. (2018). Analyzing key influences of tourists’ acceptance of online reviews in travel decisions. Internet Research, 28, 564–586.

    Article  Google Scholar 

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.

    Article  Google Scholar 

  • Cu, T., Schneider, H., & Van Scotter, J. (2017). How does sentiment content of product reviews make diffusion different? Journal of Computer Information Systems, 1–9.

  • Cui, G., Lui, H.-K., & Guo, X. (2012). The effect of online consumer reviews on new product sales. International Journal of Electronic Commerce, 17(1), 39–58.

    Article  Google Scholar 

  • Das, A. S., Mehta, S., & Subramaniam, L. V. (2017). AnnoFin–A hybrid algorithm to annotate financial text. Expert Systems with Applications, 88, 270–275.

    Article  Google Scholar 

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Deming, W. E., & Edwards, D. W. (1982). Quality, productivity, and competitive position (Vol. 183). Cambridge, MA: Massachusetts Institute of Technology, Center for advanced engineering study.

    Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint.

  • Duan, W., Gu, B., & Whinston, A. (2008). Do online reviews matter?—An empirical investigation of panel data. Decision Support Systems, 45(4), 1007–1016.

    Article  Google Scholar 

  • Eliashberg, J., Hui, S. K., & Zhang, Z. J. (2014). Assessing box office performance using movie scripts: A kernel-based approach. IEEE Transactions on Knowledge and Data Engineering, 26(11), 2639–2648.

    Article  Google Scholar 

  • Fan, W., & Gordon, M. D. (2014). The power of social media analytics. Communications of the ACM, 57(6), 74–81.

    Article  Google Scholar 

  • Fan, W., Gordon, M. D., & Pathak, P. (2005). Effective profiling of consumer information retrieval needs: A unified framework and empirical comparison. Decision Support Systems, 40(2), 213–233.

    Article  Google Scholar 

  • Fleiss, J. L., Levin, B., & Paik, M. C. (2013). Statistical methods for rates and proportions. Hoboken: Wiley.

    Google Scholar 

  • Fornell, C., Johnson, M. D., Anderson, E. W., Cha, J., & Bryant, B. E. (1996). The American customer satisfaction index: Nature, purpose, and findings. The Journal of Marketing, 60, 7–18.

    Article  Google Scholar 

  • Ghiassi, M., Zimbra, D., & Lee, S. (2016). Targeted twitter sentiment analysis for brands using supervised feature engineering and the dynamic architecture for artificial neural networks. Journal of Management Information Systems, 33(4), 1034–1058.

    Article  Google Scholar 

  • Goldberg, D. M., & Abrahams, A. S. (2018). A Tabu search heuristic for smoke term curation in safety defect discovery. Decision Support Systems, 105, 52–65.

    Article  Google Scholar 

  • Goldberg, D. M., Khan, S., Zaman, N., Gruss, R. J., & Abrahams, A. S. (2021). Text mining approaches for postmarket food safety surveillance using online media. Risk Analysis.

  • Gopal, R., Marsden, J. R., & Vanthienen, J. (2011). Information mining—Reflections on recent advancements and the road ahead in data, text, and media mining. In: Elsevier.

  • Guo, Y., Barnes, S. J., & Jia, Q. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation. Tourism Management, 59, 467–483.

    Article  Google Scholar 

  • He, W., Tian, X., Chen, Y., & Chong, D. (2016). Actionable social media competitive analytics for understanding customer experiences. Journal of Computer Information Systems, 56(2), 145–155.

    Article  Google Scholar 

  • Hendricks, K. B., & Singhal, V. R. (1997). Does implementing an effective TQM program actually improve operating performance? Empirical evidence from firms that have won quality awards. Management Science, 43(9), 1258–1274.

    Article  Google Scholar 

  • Hendricks, K. B., & Singhal, V. R. (2001). The long-run stock price performance of firms with effective TQM programs. Management Science, 47(3), 359–368.

    Article  Google Scholar 

  • Holton, C. (2009). Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem. Decision Support Systems, 46(4), 853–864.

    Article  Google Scholar 

  • Hora, M., Bapuji, H., & Roth, A. V. (2011). Safety hazard and time to recall: The role of recall strategy, product defect type, and supply chain player in the US toy industry. Journal of Operations Management, 29(7–8), 766–777.

    Article  Google Scholar 

  • Hu, N., Pavlou, P. A., & Zhang, J. (2006). Can online reviews reveal a product's true quality? Empirical findings and analytical modeling of online word-of-mouth communication. Paper presented at the proceedings of the 7th ACM Conference on Electronic Commerce.

  • Hu, N., Liu, L., & Zhang, J. J. (2008). Do online reviews affect product sales? The role of reviewer characteristics and temporal effects. Information Technology & Management, 9(3), 201–214.

    Article  Google Scholar 

  • Hu, N., Pavlou, P. A., & Zhang, J. J. (2009). Why do online product reviews have a J-shaped distribution? Overcoming biases in online word-of-mouth communication. Communications of the ACM, 52(10), 144–147.

    Article  Google Scholar 

  • Hu, N., Bose, I., Koh, N. S., & Liu, L. (2012). Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision Support Systems, 52(3), 674–684.

    Article  Google Scholar 

  • Hu, N., Koh, N. S., & Reddy, S. K. (2014). Ratings lead you to the product, reviews help you clinch it? The mediating role of online review sentiments on product sales. Decision Support Systems, 57, 42–53.

    Article  Google Scholar 

  • Jarrell, G., & Peltzman, S. (1985). The impact of product recalls on the wealth of sellers. Journal of Political Economy, 93(3), 512–536.

    Article  Google Scholar 

  • Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.

    Article  Google Scholar 

  • Jung, Y., & Suh, Y. (2019). Mining the voice of employees: A text mining approach to identifying and analyzing job satisfaction factors from online employee reviews. Decision Support Systems., 123, 113074.

    Article  Google Scholar 

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  • Lau, R. Y., Li, C., & Liao, S. S. (2014). Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decision Support Systems, 65, 80–94.

    Article  Google Scholar 

  • Law, D., Gruss, R., & Abrahams, A. S. (2017). Automated defect discovery for dishwasher appliances from online consumer reviews. Expert Systems with Applications, 67, 84–94.

    Article  Google Scholar 

  • Lee, J., Park, D.-H., & Han, I. (2008). The effect of negative online consumer reviews on product attitude: An information processing view. Electronic Commerce Research and Applications, 7(3), 341–352.

    Article  Google Scholar 

  • Lee, S., Song, J., & Kim, Y. (2010). An empirical comparison of four text mining methods. Journal of Computer Information Systems, 51(1), 1–10.

    Google Scholar 

  • Liu, Y., Jiang, C., & Zhao, H. (2018). Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums. Decision Support Systems, 105, 1–12.

    Article  Google Scholar 

  • Lyles, M. A., Flynn, B. B., & Frohlich, M. T. (2008). All supply chains don't flow through: Understanding supply chain issues in product recalls. Management and Organization Review, 4(2), 167–182.

    Article  Google Scholar 

  • McAuley, J., Pandey, R., & Leskovec, J. (2015). Inferring networks of substitutable and complementary products. Paper presented at the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint.

  • Moro, S., Cortez, P., & Rita, P. (2015). Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation. Expert Systems with Applications, 42(3), 1314–1324.

    Article  Google Scholar 

  • Mostafa, M. M. (2013). More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications, 40(10), 4241–4251.

    Article  Google Scholar 

  • Mummalaneni, V., Gruss, R., Goldberg, D. M., Ehsani, J. P., & Abrahams, A. S. (2018). Social media analytics for quality surveillance and safety hazard detection in baby cribs. Safety Science, 104, 260–268.

    Article  Google Scholar 

  • Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. Paper presented at the 20th annual international ACM SIGIR conference on Research and Development in information retrieval.

  • Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Paper presented at the 1st Workshop on Making Sense of Microposts.

  • Oberreuter, G., & VeláSquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763.

    Article  Google Scholar 

  • Park, C., & Lee, T. M. (2009). Information direction, website reputation and eWOM effect: A moderating role of product type. Journal of Business Research, 62(1), 61–67.

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Paper presented at the Conference on Empirical Methods in Natural Language Processing.

  • Phillips, P., Zigan, K., Silva, M. M. S., & Schegg, R. (2015). The interactive effects of online reviews on the determinants of Swiss hotel performance: A neural network analysis. Tourism Management, 50, 130–141.

    Article  Google Scholar 

  • Porter, M. E., & Van der Linde, C. (1995). Toward a new conception of the environment-competitiveness relationship. Journal of Economic Perspectives, 9(4), 97–118.

    Article  Google Scholar 

  • Qi, J., Zhang, Z., Jeon, S., & Zhou, Y. (2016). Mining customer requirements from online reviews: A product improvement perspective. Information & Management, 53(8), 951–963.

    Article  Google Scholar 

  • Qiao, Z., Zhang, X., Zhou, M., Wang, G. A., & Fan, W. (2017). A domain oriented LDA model for mining product defects from online customer reviews. Paper presented at the 50th Hawaii International Conference on System Sciences.

  • Rhee, M., & Haunschild, P. R. (2006). The liability of good reputation: A study of product recalls in the US automobile industry. Organization Science, 17(1), 101–117.

    Article  Google Scholar 

  • Shi, D., Guan, J., Zurada, J., & Manikas, A. (2017). A data-mining approach to identification of risk factors in safety management systems. Journal of Management Information Systems, 34(4), 1054–1081.

    Article  Google Scholar 

  • Stern, H. (1962). The significance of impulse buying today. The Journal of Marketing, 26, 59–62.

    Article  Google Scholar 

  • Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558.

    Article  Google Scholar 

  • Tirunillai, S., & Tellis, G. J. (2014). Mining marketing meaning from online chatter: Strategic brand analysis of big data using latent dirichlet allocation. Journal of Marketing Research, 51(4), 463–479.

    Article  Google Scholar 

  • Winkler, M., Abrahams, A. S., Gruss, R., & Ehsani, J. P. (2016). Toy safety surveillance from online reviews. Decision Support Systems, 90, 23–32.

    Article  Google Scholar 

  • Yu, Y., Duan, W., & Cao, Q. (2013). The impact of social and conventional media on firm equity value: A sentiment analysis approach. Decision Support Systems, 55(4), 919–926.

    Article  Google Scholar 

  • Zaman, N., Goldberg, D. M., Abrahams, A. S., & Essig, R. A. (2020). Facebook hospital reviews: Automated service quality detection and relationships with patient satisfaction. Decision Sciences.

  • Zhang, Z. (2008). Mining relational data from text: From strictly supervised to weakly supervised learning. Information Systems, 33(3), 300–314.

    Article  Google Scholar 

  • Zhao, W. X., Jiang, J., Yan, H., & Li, X. (2010). Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. Paper presented at the 2010 Conference on Empirical Methods in Natural Language Processing.

Download references

Acknowledgments

Alan S. Abrahams and Peter Ractham gratefully acknowledge support for this work from Thammasat University in the form of the Bualuang ASEAN Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David M. Goldberg.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

To curate smoke terms, we utilize the CC score algorithm (Fan et al., 2005), an information retrieval technique that prior works have shown to be quite effective in smoke term curation (Abrahams et al., 2012, 2013, 2015; Goldberg & Abrahams, 2018). As the chi-squared distribution is used to test for the independence between two variables in statistics, information retrieval has utilized this principle to examine relationships between documents and words that they contain. Ng et al. (1997) first suggested a means of using a one-sided chi-squared test to select words or phrases associated with a relevant classification of documents; Fan et al. (2005) later expanded upon this technique. The CC score algorithm generates a relevance score for each term (word or phrase) in a corpus, where higher scores indicate more relevant terms that may be predictive of the target classification. We first distinguish between relevant documents (in our study, reviews) from the target classification (in our study, defect-related reviews) and non-relevant documents not from the target classification (in our study, non-defect-related reviews). Consider Table 8, which defines the relationships between document relevance and inclusion/exclusion of terms.

Table 8 Contingency table for CC score algorithm (adapted from Fan et al. (2005))

Given this contingency table, terms are given higher scores when they are especially frequent in documents that are relevant and especially infrequent in documents that are irrelevant. The CC score algorithm defines this relevance as follows for each term in the corpus:

$$ Relevance=\frac{\sqrt{N}\times \left( AD- CB\right)}{\sqrt{\left(A+B\right)\times \left(C+D\right)}} $$
(1)

The CC score algorithm generates a relevance score for each term in the corpus such that scores with high relevance scores occur frequently in relevant documents and infrequently in irrelevant documents. Thus, we may use high-scoring terms as predictors of relevance in unseen documents. After using the CC score algorithm to generate relevance scores for each smoke term, the lead author analyzed the top-ranking terms to remove any stop words (common English words like “a,” “an,” and “the”), common brands names, and/or common product (or service) categories (Abrahams et al., 2012, 2013, 2015). A coauthor further reviewed and reverified these decisions to ensure accuracy. The retained terms are referred to as smoke terms, and each set of smoke terms is referred to as a smoke term list.

When analyzing unseen reviews (e.g., our holdout set), we use the appropriate smoke term list to generate “smoke scores” for each review. For a given review, we determine this smoke score by searching for any occurrences of the smoke terms within that review. Each time we observe an occurrence of a smoke term, we increment that review’s smoke score by that smoke term’s relevance score as determined by the CC score algorithm. Finally, using these smoke scores, we can prioritize the reviews believed to refer to defects. We can sort the reviews from the highest smoke score to the lowest smoke score, where the highest smoke scores are the most likely to refer to defects.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zaman, N., Goldberg, D.M., Gruss, R.J. et al. Cross-Category Defect Discovery from Online Reviews: Supplementing Sentiment with Category-Specific Semantics. Inf Syst Front 24, 1265–1285 (2022). https://doi.org/10.1007/s10796-021-10122-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-021-10122-y

Keywords

Navigation