Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations

Yang, Heng-Li; Chao, August F. Y.

doi:10.1007/s10796-014-9498-1

Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations

Published: 23 May 2014

Volume 17, pages 1335–1352, (2015)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Heng-Li Yang¹ &
August F. Y. Chao¹

1065 Accesses
19 Citations
Explore all metrics

Abstract

The application of sentiment analysis, also known as opinion mining, is more difficult in Chinese than in Indo-European languages, due to the compounding nature of Chinese words and phrases, and relatively lack of reliable resources in Chinese. This study used seed words, Chinese morphemes, which are mono-syllabic characters that function as individual words or be combined to create Chinese words and phrases, to classify movie reviews found on Yahoo! Taiwan. We utilized higher Pointwise Mutual Information (PMI) collocations, which consist of selected morpheme-level compounded features to build classifiers. The contributions of this study include the following: (Bird 2006) proposing a method of generating domain-dependent Chinese morphemes directly from large data set without any predefined sentimental resources; (Bradley and Lang 1999) building morpheme-based classifiers applicable in various movie genres, and shown to produce better results than other classifiers based on keywords (NTUSD and HowNet) or feature selection (TFIDF); (Church and Hanks in Computational linguistics, 16(1), 22-29 1990) identifying compounds that have different semantic polarities depending on contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentiment Analysis of Movie Reviews Written in Macedonian Language

Sentiment Analysis of Hotel Reviews in Greek: A Comparison of Unigram Features

Effective Approach for Sentiment Analysis on Movie Reviews

Notes

A Part-Of-Speech Tagger (P.O.S Tagger) is software that reads text and designates each word as a part of speech (and other token), such as noun, verb, adjective. The Part-of-speech tools from SINICA CKIP are available at http://ckipsvr.iis.sinica.edu.tw/
Collected Taiwan Yahoo!Movies Corpus with P.O.S Tags from CKIP, https://github.com/fychao/ChineseMovieReviews
Simplified/traditional Chinese conversion tables include parallel translation of common words/phrases in Taiwan, China, Hong Kong, and Singapore, and can be retrieved from following link: http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/ZhConversion.php
Scikit-learn v0.12 http://scikit-learn.org/stable/
Natural Language Toolkit 2.0 https://github.com/nltk
Ten folding cross-validation is a process that chunks training dataset into 10 equal-lot of subsets, and then uses one subset for testing and others for training sequentially. Therefore, the validation process involves 10 iterations of training and testing procedures.

References

Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69–72). Sydney, Australia.
Bradley, M. M. and P. J. Lang (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings, Technical Report C-1, The Center for Research in Psychophysiology, University of Florida.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29.
Google Scholar
Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. Management Science, 53(9), 1375–1388.
Article Google Scholar
Dong, Z., & Dong, Q. (2006). HowNet and the Computation of Meaning. World Scientific.
Esuli, A., & Sebastiani, F. (2006). Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC (Vol. 6, pp.417–422). Genoa, Italy.
Feng, S., Wang, L., Xu, W., Wang, D., & Yu, G. (2012). Unsupervised learning Chinese sentiment lexicon from massive microblog data. Advanced Data Mining and Applications, 7713, 27–38.
Google Scholar
Ku, L. W., Liang, Y. T. & Chen, H. H. (2006). Opinion extraction, summarization and tracking in news and blog Corpora. Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, AAAI Technical Report, 100–107. CA, USA.
Ku, L. W., Liu, I. C., Lee, C. Y., Chen, K. H., & Chen, H. H. (2008). Sentence-level opinion analysis by COPEOPI in NTCIR-7. In Proceeding of NTCIR-7 Workshop (pp. 260–267). Tokyo, Japan.
Ku, L. W., Huang, T. H., & Chen, H. H. (2009). Using morphological and syntactic structures for Chinese opinion analysis. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 3, no.3, pp. 1260–1269). Singapore.
Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2), 354–368.
Article Google Scholar
Li, L., & Yao, T. (2007, August). Kernel-based sentiment classification for Chinese sentence. In Advanced Language Processing and Web Information Technology, ALPIT 2007. Sixth International Conference (pp. 27–32). Henan, China.
Li, D., Ma, Y. T., & Guo, J. L. (2009). Words semantic orientation classification based on HowNet. The Journal of China Universities of Posts and Telecommunications, 16(1), 106–110.
Article Google Scholar
Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing, 2nd edition.
Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture (pp. 70–77). NY, USA.
Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Annual Meeting-Association for computational linguistics, 43(1). Jeju, Korea.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1–2), 1–135.
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Duchesnay, É., et al. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Sun, Y. T., Chen, C. L., Liu, C. C., Liu, C. L., & Soo, V. W. (2010). Sentiment classification of short Chinese sentences. Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010) (pp. 184–198). San Jose de Buan, Philippines.
Tan, S., & Zhang, J. (2008). An empirical study of sentiment analysis for Chinese documents. Expert Systems with Applications, 34(4), 2622–2629.
Article Google Scholar
Turney, P. D. (2001, September). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning (pp. 491–502).
Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (pp. 417–424). Freiburg, Germany.
Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworth.
Google Scholar
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.
Book Google Scholar
Wan, X. J. (2009). Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (Vol. 1, pp. 235–243). Singapore.
Wang, X., Zhao, Y. Q., & Fu, G. H. (2011). A Morpheme-based Method to Chinese Sentence-Level Sentiment Classification. International Journal of Asian Language Processing, 21(3), 95–106. Penang, Malaysia.
Google Scholar
Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.
Article Google Scholar
Wu, Z., & Tseng, G. (1999). ACTS: an automatic Chinese text segmentation system for full text retrieval. Journal of the American Society for Information Science, 46(2), 83–96.
Article Google Scholar
Wu, Y., & Wen, M. (2010, August). Disambiguating dynamic sentiment ambiguous adjectives. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (pp. 1191–1199). Beijing, China.
Xu, H., Zhao, K., Qiu, L., & Hu, C. (2011). Expanding Chinese sentiment dictionaries from large scale unlabeled corpus. Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 3, 53–57. Sendai, Japan.
Google Scholar
Ye, Q., Shi,W., & Li. Y. (2006). Sentiment classification for movie reviews in Chinese by improved semantic oriented approach. Proceedings of the 39th Hawaii International Conference on System Sciences, HICSS’06, 3. Hawaii, USA.
Yuen, R. W., Chan, T. Y., Lai, T. B., Kwong, O. Y., & T’sou, B. K. (2004). Morpheme-based derivation of bipolar semantic orientation of Chinese words. In Proceedings of the 20th international conference on Computational Linguistics (pp. 1008–1014). PA, USA.
Zhang, W. H., Hua, X., & Wei, W. (2012). Weakness Finder: find product weakness from Chinese reviews by using aspects based sentiment analysis. Expert Systems with Applications, 39(11), 10283–10291.
Article Google Scholar
Zhou, X., Marslen-Wilson, W., Taft, M., & Shu, H. (1999). Morphology, orthography, and phonology reading Chinese compound words. Language and cognitive processes, 14(5–6), 525–565.
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the National Science Council, Taiwan, for financially supporting this research under Contract No. NSC 101-2410-H-004-015-MY3).

Author information

Authors and Affiliations

Department Management Information Systems, National Cheng Chi University, 64, Sec.2, Chihnan Road, Wenshan District, Taipei, Taiwan, Republic of China
Heng-Li Yang & August F. Y. Chao

Authors

Heng-Li Yang
View author publications
You can also search for this author in PubMed Google Scholar
August F. Y. Chao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heng-Li Yang.

Appendixes

1.1 Appendix I movie genres

Number of total collected opinions from Yahoo!Movies Taiwan is 127,424 with 5-star-ranked opinions including 4,631,482 words in 18 movie genres. Note that one opinion might belong to one or more movie genres at the same time.

Genres	Rank1	Rank2	Rank3	Rank4	Rank5	Category
A = 奇幻 Fantasy	1,870	700	979	1,433	4,137	Group 1
B = 科幻 Science Fiction	2,151	857	1,403	2,323	9,657	Group 1
C = 犯罪 Crime	1,116	326	535	934	2,769	Group 2
F = 動作Action	7,741	2,822	4,535	7,550	27,690	Group 2
D = 劇情 Drama	8,659	2,799	4,334	7,577	33,492	Group 3
E = 溫馨/家庭 Romance/Family	315	134	250	481	2,798
H = 愛情Love Story	3,628	1,078	1,599	2,833	13,455
P = 動畫Animation	405	193	367	779	4,534	Group 6
Q = 喜劇Comedy	2,097	855	1,323	2,463	8,250
I = 冒險 Adventure	4,023	1,301	1,893	2,818	8,414
K = 恐怖 Terror	3,261	712	1,019	1,454	2,734	Group 5
R = 懸疑/驚悚 Mystery/Thriller	6,343	1,881	2,825	4,456	11,056	Group 5
G = 勵志Inspiring	148	62	95	217	1,654	Group 4
J = 歷史/傳記 History/Biography	623	203	316	458	1,710
L = 戰爭War	888	306	468	826	4,189
M = 音樂/歌舞 Music/Dance	415	172	291	583	4,094
N = 紀錄片Documentary	647	17	43	51	878
O = 武俠 Martial Arts	160	60	112	182	436
Total: 260,720	44,490	14,478	22,387	37,418	141,947
	17 %	6 %	9 %	14 %	54 %
Counting	58,968		22,387	179,365
	22.6 %		8.6 %	68.8 %

1.2 Appendix II the pilot experiment

In a pilot experiment, we used 40,000 randomly selected opinions as training set from those pre-defined wordlists, NTUSD and HowNet, to build SVM classifiers. We kept those words their original sentiment orientation. That is, we applied positive wordlists for positive classifiers, applied negative wordlists for negative classifiers. The results show that this approach is not adequate for general purpose classifiers, because all statistical data are quite low except NTUSD negative wordlist.

Use keyword list to build SVM Model	Result
Use: HOWNET positive Model: positive Classifier # of features: 3,651	F1:0.272, Precision :0.307, Rrecall:0.244
Use: HOWNET negative Model: negative Classifier # of features: 3,036	F1:0.073, Precision:0.079, Rrecall:0.049
Use: NTUSD positive Model: positive Classifier # of features: 1,239	F1:0.067, Precision:0.074, Rrecall:0.061
Use: NTUSD negative Model: negative Classifier # of features: 4,829	F1:0.706, Precision:0.684, Recall :0.729

1.3 Appendix III exclusion list and sentence boundary

The following P.O.S tags are exclusion list since they could not have essential meaning for sentimental analysis.

1.
‘Caa’ (tagged as conjunction),
2.
‘D’, ‘DE’, ‘Dfa’ (tagged as adverb),
3.
‘Nh’ (tagged as pronoun),
4.
‘Ndaa’, ‘Ndab’,’Ndc’,’Ndd’ (tagged as time noun),
5.
’Nep’, ‘Neqa’,’Neqb’,’Nes’, ‘Neu’ (tagged as modifier),
6.
‘Nf’, ‘Nfa’, ‘Nfb’, ‘Nfc’, ‘Nfd’, ‘Nfe’, ‘Nfg’, ‘Nfh’, ‘Nfi’, (tagged as quantifier),
7.
‘T’, ‘Ta’, ‘Tb’, ‘Tc’, ‘Td’ (tagged as interjection, auxiliary word),
8.
‘V_2’ (tagged as “有”, i.e. “have” or “has”),
9.
‘SHI’ (tagged as “是”, i.e. “is” or “are”)

To determine the end-of-sentence in opinion, we use following P.O.S tags as sentence boundary:

1.
“FW”,
2.
“QUESTIONCATEGORY”,
3.
“COLONCATEGORY”,
4.
“COMMACATEGORY”,
5.
“DASHCATEGORY”,
6.
“ETCCATEGORY”,
7.
“PARENTHESISCATEGORY”,
8.
“PAUSECATEGORY”,
9.
“PERIODCATEGORY”,
10.
“QUESTIONCATEGORY”,
11.
“SEMICOLONCATEGORY”,
12.
“EXCLANATIONCATEGORY”,
13.
“BR”,//HTML mark for end of sentence
14.
“SPCHANGECATEGORY”

For more information of Sinica CKIP tagging, please refer to http://ckipsvr.iis.sinica.edu.tw/cat.htm

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, HL., Chao, A.F.Y. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations. Inf Syst Front 17, 1335–1352 (2015). https://doi.org/10.1007/s10796-014-9498-1

Download citation

Published: 23 May 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10796-014-9498-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Movie Reviews Written in Macedonian Language

Sentiment Analysis of Hotel Reviews in Greek: A Comparison of Unigram Features

Effective Approach for Sentiment Analysis on Movie Reviews

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendixes

1.1 Appendix I movie genres

1.2 Appendix II the pilot experiment

1.3 Appendix III exclusion list and sentence boundary

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Movie Reviews Written in Macedonian Language

Sentiment Analysis of Hotel Reviews in Greek: A Comparison of Unigram Features

Effective Approach for Sentiment Analysis on Movie Reviews

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendixes

Appendixes

1.1 Appendix I movie genres

1.2 Appendix II the pilot experiment

1.3 Appendix III exclusion list and sentence boundary

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation