Abstract
Clustering-based sentiment analysis is a novel approach for analyzing opinions expressed in reviews, comments or blogs. In contrast to the two traditional mainstream approaches (supervised learning and symbolic techniques), the clustering-based approach is able to produce basically accurate analysis results without any human participation, linguist knowledge or training time.
This paper introduces new techniques designed to extend the capability of the clustering-based sentiment analysis approach in two aspects: firstly by applying opposite opinion contents processing and non-opinion contents processing techniques to further enhance accuracy; and secondly by using a modified voting mechanism and distance measurement method to conduct fine-grained (three classes) sentiment analysis. According to the experiment results, the clustering-based approach is proven to be useful in performing high quality sentiment analysis result, and suitable for recognizing neutral opinions.
Similar content being viewed by others
Notes
Once the TF-IDF weights are calculated by using frequency of data, the weigh values are also able to be applied on presence of data.
In this paper, documents with large proportions of objective content are not regarded as neutral documents. The object of the study is opinion expressing documents, though they usually involved small proportions of objective content.
References
Hitlin PLR (2004) The use of online reputation and rating systems. In: Pew Internet & American Life Project Memo. doi:10.1016/j.dss.2005.05.019
Group ctK (2007) Online consumer-generated reviews have significant impact on offline purchase behavior. http://www.comscore.com/Press_Events/Press_Releases/2007/11/Online_Consumer_Reviews_Impact_Offline_Purchasing_Behavior
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135. doi:10.1561/1500000011
Chiu C-M (2004) Towards a hypermedia-enabled and web-based data analysis framework. J Inf Sci 30(1):60. doi:10.1177/0165551504041679
Tang H, Tan S, Cheng X (2009) A survey on sentiment detection of reviews. Expert Syst Appl 36(7):10760–10773. doi:10.1016/j.eswa.2009.02.063
Boiy E, Hens P, Deschacht K, Moens M-F (2007) Automatic sentiment analysis in on-line text. In: International conference on electronic publishing pages, Vienna, Austria, pp 349–360
Li G, Liu F (2012) Application of a clustering method on sentiment analysis. J Inf Sci 38(2):127–139. doi:10.1177/0165551511432670
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing (EMNLP), Philadelphia, Pennsylvania, USA, p 79. doi:10.3115/1118693.1118704
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting on association for Computational Linguistics, Stroudsburg, PA, USA. Association for Computational Linguistics, p 271. doi:10.3115/1218955.1218990
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 115–124. doi:10.3115/1219840.1219855
Cesarano C, Dorr B, Picariello A, Reforgiato D, Sagoff A, Subrahmanian VS (2004) Oasys: an opinion analysis system. In: AAAI spring symposium on computational approaches to analyzing weblogs
Kamps J, Marx M, Mokken RJ, De Rijke M (2004) Using wordnet to measure semantic orientations of adjectives. Paper presented at the International conference on language resources and evaluation
Turney PD (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting of the association for computational linguistics (ACL), Philadelphia, Pennsylvania, USA, p 417. doi:10.3115/1073083.1073153
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Andrews NO, Fox EA (2007) Recent developments in document clustering. Computer Science, Virginia Tech, Tech Rep
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval∗1. Inf Process Manag 24(5):513–523
Al-Harbi S, Rayward-Smith V (2006) Adapting k-means for supervised clustering. Appl Intell 24(3):219–226
Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison Wesley, Boston
Xia R, Zong C, Li S (2011) Ensemble of feature sets and classification algorithms for sentiment classification. Inf Sci 181(6):1138–1152
Tan S (2008) An improved centroid classifier for text categorization. Expert Syst Appl 35(1):279–285
Bai X (2011) Predicting consumer sentiments from online text. Decis Support Syst 50(4):732–742
Goldberg AB, Zhu X (2006) Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. Association for Computational Linguistics, pp 45–52
Chaovalit P, Zhou L (2005) Movie review mining: a comparison between supervised and unsupervised classification approaches. Paper presented at the Proceedings of the 38th Hawaii international conference on system sciences
Shi K, Li L (2012) High performance genetic algorithm based text clustering using parts of speech and outlier elimination. Appl Intell:1–9
Laszlo M, Mukherjee S (2007) A genetic algorithm that exchanges neighboring centers for 〈i〉k〈/i〉-means clustering. Pattern Recognit Lett 28(16):2359–2366
Poomagal S, Hamsapriya T (2011) Optimized k-means clustering with intelligent initial centroid selection for web search using URL and tag contents. In: Proceedings of the international conference on web intelligence, mining and semantics. ACM, New York, p 65
Menéndez H, Camacho D (2012) A genetic graph-based clustering algorithm. In: Intelligent data engineering and automated learning-IDEAL 2012. Springer, Berlin, pp 216–225
Hong T-P, Lin C-W, Yang K-T, Wang S-L (2012) Using TF-IDF to hide sensitive itemsets. Appl Intell:1–9
Manthey B, Röglin H (2009) Improved smoothed analysis of the k-means method. In: Society for industrial and applied mathematics, pp 461–470
Yu H, Hatzivassiloglou V (2003) Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In: Conference on empirical methods in natural language processing, Stroudsburg, PA, USA. Association for Computational Linguistics, p 129. doi:10.3115/1119355.1119372
Hatzivassiloglou V, Klavans JL, Holcombe ML, Barzilay R, Kan MY, McKeown KR (2001) Simfinder: a flexible clustering tool for summarization. In: Citeseer
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: KDD-99. ACM, New York, pp 16–22. doi:10.1145/312129.312186
Koppel M, Schler J (2006) The importance of neutral examples for learning sentiment. Comput Intell 22(2):100–109. doi:10.1111/j.1467-8640.2006.00276.x
Kleinberg J, Tardos E (1999) Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields. In: IEEE, pp 14–23. doi:10.1109/SFFCS.1999.814572
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Yang M-S, Lai C-Y, Lin C-Y (2012) A robust EM clustering algorithm for Gaussian mixture models. Pattern Recognit
Yokoyama S, Nakayama A, Okada A (2009) One-mode three-way overlapping cluster analysis. Comput Stat 24(1):165–179
Bello-Orgaz G, Menéndez HD, Camacho D (2012) Adaptive k-means algorithm for overlapped graph clustering. Int J Neural Syst 22(05)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, G., Liu, F. Sentiment analysis based on clustering: a framework in improving accuracy and recognizing neutral opinions. Appl Intell 40, 441–452 (2014). https://doi.org/10.1007/s10489-013-0463-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-013-0463-3