Abstract
The application of sentiment analysis, also known as opinion mining, is more difficult in Chinese than in Indo-European languages, due to the compounding nature of Chinese words and phrases, and relatively lack of reliable resources in Chinese. This study used seed words, Chinese morphemes, which are mono-syllabic characters that function as individual words or be combined to create Chinese words and phrases, to classify movie reviews found on Yahoo! Taiwan. We utilized higher Pointwise Mutual Information (PMI) collocations, which consist of selected morpheme-level compounded features to build classifiers. The contributions of this study include the following: (Bird 2006) proposing a method of generating domain-dependent Chinese morphemes directly from large data set without any predefined sentimental resources; (Bradley and Lang 1999) building morpheme-based classifiers applicable in various movie genres, and shown to produce better results than other classifiers based on keywords (NTUSD and HowNet) or feature selection (TFIDF); (Church and Hanks in Computational linguistics, 16(1), 22-29 1990) identifying compounds that have different semantic polarities depending on contexts.
Similar content being viewed by others
Notes
A Part-Of-Speech Tagger (P.O.S Tagger) is software that reads text and designates each word as a part of speech (and other token), such as noun, verb, adjective. The Part-of-speech tools from SINICA CKIP are available at http://ckipsvr.iis.sinica.edu.tw/
Collected Taiwan Yahoo!Movies Corpus with P.O.S Tags from CKIP, https://github.com/fychao/ChineseMovieReviews
Simplified/traditional Chinese conversion tables include parallel translation of common words/phrases in Taiwan, China, Hong Kong, and Singapore, and can be retrieved from following link: http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/ZhConversion.php
Scikit-learn v0.12 http://scikit-learn.org/stable/
Natural Language Toolkit 2.0 https://github.com/nltk
Ten folding cross-validation is a process that chunks training dataset into 10 equal-lot of subsets, and then uses one subset for testing and others for training sequentially. Therefore, the validation process involves 10 iterations of training and testing procedures.
References
Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69–72). Sydney, Australia.
Bradley, M. M. and P. J. Lang (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings, Technical Report C-1, The Center for Research in Psychophysiology, University of Florida.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29.
Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. Management Science, 53(9), 1375–1388.
Dong, Z., & Dong, Q. (2006). HowNet and the Computation of Meaning. World Scientific.
Esuli, A., & Sebastiani, F. (2006). Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC (Vol. 6, pp.417–422). Genoa, Italy.
Feng, S., Wang, L., Xu, W., Wang, D., & Yu, G. (2012). Unsupervised learning Chinese sentiment lexicon from massive microblog data. Advanced Data Mining and Applications, 7713, 27–38.
Ku, L. W., Liang, Y. T. & Chen, H. H. (2006). Opinion extraction, summarization and tracking in news and blog Corpora. Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, AAAI Technical Report, 100–107. CA, USA.
Ku, L. W., Liu, I. C., Lee, C. Y., Chen, K. H., & Chen, H. H. (2008). Sentence-level opinion analysis by COPEOPI in NTCIR-7. In Proceeding of NTCIR-7 Workshop (pp. 260–267). Tokyo, Japan.
Ku, L. W., Huang, T. H., & Chen, H. H. (2009). Using morphological and syntactic structures for Chinese opinion analysis. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 3, no.3, pp. 1260–1269). Singapore.
Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2), 354–368.
Li, L., & Yao, T. (2007, August). Kernel-based sentiment classification for Chinese sentence. In Advanced Language Processing and Web Information Technology, ALPIT 2007. Sixth International Conference (pp. 27–32). Henan, China.
Li, D., Ma, Y. T., & Guo, J. L. (2009). Words semantic orientation classification based on HowNet. The Journal of China Universities of Posts and Telecommunications, 16(1), 106–110.
Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing, 2nd edition.
Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.
Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture (pp. 70–77). NY, USA.
Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Annual Meeting-Association for computational linguistics, 43(1). Jeju, Korea.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1–2), 1–135.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Duchesnay, É., et al. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Sun, Y. T., Chen, C. L., Liu, C. C., Liu, C. L., & Soo, V. W. (2010). Sentiment classification of short Chinese sentences. Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010) (pp. 184–198). San Jose de Buan, Philippines.
Tan, S., & Zhang, J. (2008). An empirical study of sentiment analysis for Chinese documents. Expert Systems with Applications, 34(4), 2622–2629.
Turney, P. D. (2001, September). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning (pp. 491–502).
Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (pp. 417–424). Freiburg, Germany.
Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworth.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.
Wan, X. J. (2009). Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (Vol. 1, pp. 235–243). Singapore.
Wang, X., Zhao, Y. Q., & Fu, G. H. (2011). A Morpheme-based Method to Chinese Sentence-Level Sentiment Classification. International Journal of Asian Language Processing, 21(3), 95–106. Penang, Malaysia.
Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.
Wu, Z., & Tseng, G. (1999). ACTS: an automatic Chinese text segmentation system for full text retrieval. Journal of the American Society for Information Science, 46(2), 83–96.
Wu, Y., & Wen, M. (2010, August). Disambiguating dynamic sentiment ambiguous adjectives. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (pp. 1191–1199). Beijing, China.
Xu, H., Zhao, K., Qiu, L., & Hu, C. (2011). Expanding Chinese sentiment dictionaries from large scale unlabeled corpus. Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 3, 53–57. Sendai, Japan.
Ye, Q., Shi,W., & Li. Y. (2006). Sentiment classification for movie reviews in Chinese by improved semantic oriented approach. Proceedings of the 39th Hawaii International Conference on System Sciences, HICSS’06, 3. Hawaii, USA.
Yuen, R. W., Chan, T. Y., Lai, T. B., Kwong, O. Y., & T’sou, B. K. (2004). Morpheme-based derivation of bipolar semantic orientation of Chinese words. In Proceedings of the 20th international conference on Computational Linguistics (pp. 1008–1014). PA, USA.
Zhang, W. H., Hua, X., & Wei, W. (2012). Weakness Finder: find product weakness from Chinese reviews by using aspects based sentiment analysis. Expert Systems with Applications, 39(11), 10283–10291.
Zhou, X., Marslen-Wilson, W., Taft, M., & Shu, H. (1999). Morphology, orthography, and phonology reading Chinese compound words. Language and cognitive processes, 14(5–6), 525–565.
Acknowledgments
The authors would like to thank the National Science Council, Taiwan, for financially supporting this research under Contract No. NSC 101-2410-H-004-015-MY3).
Author information
Authors and Affiliations
Corresponding author
Appendixes
Appendixes
1.1 Appendix I movie genres
Number of total collected opinions from Yahoo!Movies Taiwan is 127,424 with 5-star-ranked opinions including 4,631,482 words in 18 movie genres. Note that one opinion might belong to one or more movie genres at the same time.
Genres | Rank1 | Rank2 | Rank3 | Rank4 | Rank5 | Category |
A = 奇幻 Fantasy | 1,870 | 700 | 979 | 1,433 | 4,137 | Group 1 |
B = 科幻 Science Fiction | 2,151 | 857 | 1,403 | 2,323 | 9,657 | |
C = 犯罪 Crime | 1,116 | 326 | 535 | 934 | 2,769 | Group 2 |
F = 動作Action | 7,741 | 2,822 | 4,535 | 7,550 | 27,690 | |
D = 劇情 Drama | 8,659 | 2,799 | 4,334 | 7,577 | 33,492 | Group 3 |
E = 溫馨/家庭 Romance/Family | 315 | 134 | 250 | 481 | 2,798 | |
H = 愛情Love Story | 3,628 | 1,078 | 1,599 | 2,833 | 13,455 | |
P = 動畫Animation | 405 | 193 | 367 | 779 | 4,534 | Group 6 |
Q = 喜劇Comedy | 2,097 | 855 | 1,323 | 2,463 | 8,250 | |
I = 冒險 Adventure | 4,023 | 1,301 | 1,893 | 2,818 | 8,414 | |
K = 恐怖 Terror | 3,261 | 712 | 1,019 | 1,454 | 2,734 | Group 5 |
R = 懸疑/驚悚 Mystery/Thriller | 6,343 | 1,881 | 2,825 | 4,456 | 11,056 | |
G = 勵志Inspiring | 148 | 62 | 95 | 217 | 1,654 | Group 4 |
J = 歷史/傳記 History/Biography | 623 | 203 | 316 | 458 | 1,710 | |
L = 戰爭War | 888 | 306 | 468 | 826 | 4,189 | |
M = 音樂/歌舞 Music/Dance | 415 | 172 | 291 | 583 | 4,094 | |
N = 紀錄片Documentary | 647 | 17 | 43 | 51 | 878 | |
O = 武俠 Martial Arts | 160 | 60 | 112 | 182 | 436 | |
Total: 260,720 | 44,490 | 14,478 | 22,387 | 37,418 | 141,947 | |
17 % | 6 % | 9 % | 14 % | 54 % | ||
Counting | 58,968 | 22,387 | 179,365 | |||
22.6 % | 8.6 % | 68.8 % |
1.2 Appendix II the pilot experiment
In a pilot experiment, we used 40,000 randomly selected opinions as training set from those pre-defined wordlists, NTUSD and HowNet, to build SVM classifiers. We kept those words their original sentiment orientation. That is, we applied positive wordlists for positive classifiers, applied negative wordlists for negative classifiers. The results show that this approach is not adequate for general purpose classifiers, because all statistical data are quite low except NTUSD negative wordlist.
Use keyword list to build SVM Model | Result |
Use: HOWNET positive Model: positive Classifier # of features: 3,651 | F1:0.272, Precision :0.307, Rrecall:0.244 |
Use: HOWNET negative Model: negative Classifier # of features: 3,036 | F1:0.073, Precision:0.079, Rrecall:0.049 |
Use: NTUSD positive Model: positive Classifier # of features: 1,239 | F1:0.067, Precision:0.074, Rrecall:0.061 |
Use: NTUSD negative Model: negative Classifier # of features: 4,829 | F1:0.706, Precision:0.684, Recall :0.729 |
1.3 Appendix III exclusion list and sentence boundary
The following P.O.S tags are exclusion list since they could not have essential meaning for sentimental analysis.
-
1.
‘Caa’ (tagged as conjunction),
-
2.
‘D’, ‘DE’, ‘Dfa’ (tagged as adverb),
-
3.
‘Nh’ (tagged as pronoun),
-
4.
‘Ndaa’, ‘Ndab’,’Ndc’,’Ndd’ (tagged as time noun),
-
5.
’Nep’, ‘Neqa’,’Neqb’,’Nes’, ‘Neu’ (tagged as modifier),
-
6.
‘Nf’, ‘Nfa’, ‘Nfb’, ‘Nfc’, ‘Nfd’, ‘Nfe’, ‘Nfg’, ‘Nfh’, ‘Nfi’, (tagged as quantifier),
-
7.
‘T’, ‘Ta’, ‘Tb’, ‘Tc’, ‘Td’ (tagged as interjection, auxiliary word),
-
8.
‘V_2’ (tagged as “有”, i.e. “have” or “has”),
-
9.
‘SHI’ (tagged as “是”, i.e. “is” or “are”)
To determine the end-of-sentence in opinion, we use following P.O.S tags as sentence boundary:
-
1.
“FW”,
-
2.
“QUESTIONCATEGORY”,
-
3.
“COLONCATEGORY”,
-
4.
“COMMACATEGORY”,
-
5.
“DASHCATEGORY”,
-
6.
“ETCCATEGORY”,
-
7.
“PARENTHESISCATEGORY”,
-
8.
“PAUSECATEGORY”,
-
9.
“PERIODCATEGORY”,
-
10.
“QUESTIONCATEGORY”,
-
11.
“SEMICOLONCATEGORY”,
-
12.
“EXCLANATIONCATEGORY”,
-
13.
“BR”,//HTML mark for end of sentence
-
14.
“SPCHANGECATEGORY”
For more information of Sinica CKIP tagging, please refer to http://ckipsvr.iis.sinica.edu.tw/cat.htm
Rights and permissions
About this article
Cite this article
Yang, HL., Chao, A.F.Y. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations. Inf Syst Front 17, 1335–1352 (2015). https://doi.org/10.1007/s10796-014-9498-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-014-9498-1