Skip to main content
Log in

Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

The application of sentiment analysis, also known as opinion mining, is more difficult in Chinese than in Indo-European languages, due to the compounding nature of Chinese words and phrases, and relatively lack of reliable resources in Chinese. This study used seed words, Chinese morphemes, which are mono-syllabic characters that function as individual words or be combined to create Chinese words and phrases, to classify movie reviews found on Yahoo! Taiwan. We utilized higher Pointwise Mutual Information (PMI) collocations, which consist of selected morpheme-level compounded features to build classifiers. The contributions of this study include the following: (Bird 2006) proposing a method of generating domain-dependent Chinese morphemes directly from large data set without any predefined sentimental resources; (Bradley and Lang 1999) building morpheme-based classifiers applicable in various movie genres, and shown to produce better results than other classifiers based on keywords (NTUSD and HowNet) or feature selection (TFIDF); (Church and Hanks in Computational linguistics, 16(1), 22-29 1990) identifying compounds that have different semantic polarities depending on contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. A Part-Of-Speech Tagger (P.O.S Tagger) is software that reads text and designates each word as a part of speech (and other token), such as noun, verb, adjective. The Part-of-speech tools from SINICA CKIP are available at http://ckipsvr.iis.sinica.edu.tw/

  2. Collected Taiwan Yahoo!Movies Corpus with P.O.S Tags from CKIP, https://github.com/fychao/ChineseMovieReviews

  3. Simplified/traditional Chinese conversion tables include parallel translation of common words/phrases in Taiwan, China, Hong Kong, and Singapore, and can be retrieved from following link: http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/ZhConversion.php

  4. Scikit-learn v0.12 http://scikit-learn.org/stable/

  5. Natural Language Toolkit 2.0 https://github.com/nltk

  6. Ten folding cross-validation is a process that chunks training dataset into 10 equal-lot of subsets, and then uses one subset for testing and others for training sequentially. Therefore, the validation process involves 10 iterations of training and testing procedures.

References

  • Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69–72). Sydney, Australia.

  • Bradley, M. M. and P. J. Lang (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings, Technical Report C-1, The Center for Research in Psychophysiology, University of Florida.

  • Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29.

    Google Scholar 

  • Das, S., & Chen, M. (2001). Yahoo! for Amazon: extracting market sentiment from stock message boards. Management Science, 53(9), 1375–1388.

    Article  Google Scholar 

  • Dong, Z., & Dong, Q. (2006). HowNet and the Computation of Meaning. World Scientific.

  • Esuli, A., & Sebastiani, F. (2006). Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC (Vol. 6, pp.417–422). Genoa, Italy.

  • Feng, S., Wang, L., Xu, W., Wang, D., & Yu, G. (2012). Unsupervised learning Chinese sentiment lexicon from massive microblog data. Advanced Data Mining and Applications, 7713, 27–38.

    Google Scholar 

  • Ku, L. W., Liang, Y. T. & Chen, H. H. (2006). Opinion extraction, summarization and tracking in news and blog Corpora. Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, AAAI Technical Report, 100–107. CA, USA.

  • Ku, L. W., Liu, I. C., Lee, C. Y., Chen, K. H., & Chen, H. H. (2008). Sentence-level opinion analysis by COPEOPI in NTCIR-7. In Proceeding of NTCIR-7 Workshop (pp. 260–267). Tokyo, Japan.

  • Ku, L. W., Huang, T. H., & Chen, H. H. (2009). Using morphological and syntactic structures for Chinese opinion analysis. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 3, no.3, pp. 1260–1269). Singapore.

  • Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2), 354–368.

    Article  Google Scholar 

  • Li, L., & Yao, T. (2007, August). Kernel-based sentiment classification for Chinese sentence. In Advanced Language Processing and Web Information Technology, ALPIT 2007. Sixth International Conference (pp. 27–32). Henan, China.

  • Li, D., Ma, Y. T., & Guo, J. L. (2009). Words semantic orientation classification based on HowNet. The Journal of China Universities of Posts and Telecommunications, 16(1), 106–110.

    Article  Google Scholar 

  • Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing, 2nd edition.

  • Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture (pp. 70–77). NY, USA.

  • Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Annual Meeting-Association for computational linguistics, 43(1). Jeju, Korea.

  • Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1–2), 1–135.

    Article  Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Duchesnay, É., et al. (2011). Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Sun, Y. T., Chen, C. L., Liu, C. C., Liu, C. L., & Soo, V. W. (2010). Sentiment classification of short Chinese sentences. Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010) (pp. 184–198). San Jose de Buan, Philippines.

  • Tan, S., & Zhang, J. (2008). An empirical study of sentiment analysis for Chinese documents. Expert Systems with Applications, 34(4), 2622–2629.

    Article  Google Scholar 

  • Turney, P. D. (2001, September). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning (pp. 491–502).

  • Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics (pp. 417–424). Freiburg, Germany.

  • Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworth.

    Google Scholar 

  • Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.

    Book  Google Scholar 

  • Wan, X. J. (2009). Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (Vol. 1, pp. 235–243). Singapore.

  • Wang, X., Zhao, Y. Q., & Fu, G. H. (2011). A Morpheme-based Method to Chinese Sentence-Level Sentiment Classification. International Journal of Asian Language Processing, 21(3), 95–106. Penang, Malaysia.

    Google Scholar 

  • Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.

    Article  Google Scholar 

  • Wu, Z., & Tseng, G. (1999). ACTS: an automatic Chinese text segmentation system for full text retrieval. Journal of the American Society for Information Science, 46(2), 83–96.

    Article  Google Scholar 

  • Wu, Y., & Wen, M. (2010, August). Disambiguating dynamic sentiment ambiguous adjectives. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (pp. 1191–1199). Beijing, China.

  • Xu, H., Zhao, K., Qiu, L., & Hu, C. (2011). Expanding Chinese sentiment dictionaries from large scale unlabeled corpus. Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 3, 53–57. Sendai, Japan.

    Google Scholar 

  • Ye, Q., Shi,W., & Li. Y. (2006). Sentiment classification for movie reviews in Chinese by improved semantic oriented approach. Proceedings of the 39th Hawaii International Conference on System Sciences, HICSS’06, 3. Hawaii, USA.

  • Yuen, R. W., Chan, T. Y., Lai, T. B., Kwong, O. Y., & T’sou, B. K. (2004). Morpheme-based derivation of bipolar semantic orientation of Chinese words. In Proceedings of the 20th international conference on Computational Linguistics (pp. 1008–1014). PA, USA.

  • Zhang, W. H., Hua, X., & Wei, W. (2012). Weakness Finder: find product weakness from Chinese reviews by using aspects based sentiment analysis. Expert Systems with Applications, 39(11), 10283–10291.

    Article  Google Scholar 

  • Zhou, X., Marslen-Wilson, W., Taft, M., & Shu, H. (1999). Morphology, orthography, and phonology reading Chinese compound words. Language and cognitive processes, 14(5–6), 525–565.

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the National Science Council, Taiwan, for financially supporting this research under Contract No. NSC 101-2410-H-004-015-MY3).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng-Li Yang.

Appendixes

Appendixes

1.1 Appendix I movie genres

Number of total collected opinions from Yahoo!Movies Taiwan is 127,424 with 5-star-ranked opinions including 4,631,482 words in 18 movie genres. Note that one opinion might belong to one or more movie genres at the same time.

Genres

Rank1

Rank2

Rank3

Rank4

Rank5

Category

A = 奇幻 Fantasy

1,870

700

979

1,433

4,137

Group 1

B = 科幻 Science Fiction

2,151

857

1,403

2,323

9,657

C = 犯罪 Crime

1,116

326

535

934

2,769

Group 2

F = 動作Action

7,741

2,822

4,535

7,550

27,690

D = 劇情 Drama

8,659

2,799

4,334

7,577

33,492

Group 3

E = 溫馨/家庭 Romance/Family

315

134

250

481

2,798

H = 愛情Love Story

3,628

1,078

1,599

2,833

13,455

P = 動畫Animation

405

193

367

779

4,534

Group 6

Q = 喜劇Comedy

2,097

855

1,323

2,463

8,250

I = 冒險 Adventure

4,023

1,301

1,893

2,818

8,414

K = 恐怖 Terror

3,261

712

1,019

1,454

2,734

Group 5

R = 懸疑/驚悚 Mystery/Thriller

6,343

1,881

2,825

4,456

11,056

G = 勵志Inspiring

148

62

95

217

1,654

Group 4

J = 歷史/傳記 History/Biography

623

203

316

458

1,710

L = 戰爭War

888

306

468

826

4,189

M = 音樂/歌舞 Music/Dance

415

172

291

583

4,094

N = 紀錄片Documentary

647

17

43

51

878

O = 武俠 Martial Arts

160

60

112

182

436

Total: 260,720

44,490

14,478

22,387

37,418

141,947

 
 

17 %

6 %

9 %

14 %

54 %

 

Counting

58,968

22,387

179,365

 
 

22.6 %

8.6 %

68.8 %

 

1.2 Appendix II the pilot experiment

In a pilot experiment, we used 40,000 randomly selected opinions as training set from those pre-defined wordlists, NTUSD and HowNet, to build SVM classifiers. We kept those words their original sentiment orientation. That is, we applied positive wordlists for positive classifiers, applied negative wordlists for negative classifiers. The results show that this approach is not adequate for general purpose classifiers, because all statistical data are quite low except NTUSD negative wordlist.

Use keyword list to build SVM Model

Result

Use: HOWNET positive

Model: positive Classifier

# of features: 3,651

F1:0.272,

Precision :0.307,

Rrecall:0.244

Use: HOWNET negative

Model: negative Classifier

# of features: 3,036

F1:0.073,

Precision:0.079,

Rrecall:0.049

Use: NTUSD positive

Model: positive Classifier

# of features: 1,239

F1:0.067,

Precision:0.074,

Rrecall:0.061

Use: NTUSD negative

Model: negative Classifier

# of features: 4,829

F1:0.706,

Precision:0.684,

Recall :0.729

1.3 Appendix III exclusion list and sentence boundary

The following P.O.S tags are exclusion list since they could not have essential meaning for sentimental analysis.

  1. 1.

    ‘Caa’ (tagged as conjunction),

  2. 2.

    ‘D’, ‘DE’, ‘Dfa’ (tagged as adverb),

  3. 3.

    ‘Nh’ (tagged as pronoun),

  4. 4.

    ‘Ndaa’, ‘Ndab’,’Ndc’,’Ndd’ (tagged as time noun),

  5. 5.

    ’Nep’, ‘Neqa’,’Neqb’,’Nes’, ‘Neu’ (tagged as modifier),

  6. 6.

    ‘Nf’, ‘Nfa’, ‘Nfb’, ‘Nfc’, ‘Nfd’, ‘Nfe’, ‘Nfg’, ‘Nfh’, ‘Nfi’, (tagged as quantifier),

  7. 7.

    ‘T’, ‘Ta’, ‘Tb’, ‘Tc’, ‘Td’ (tagged as interjection, auxiliary word),

  8. 8.

    ‘V_2’ (tagged as “有”, i.e. “have” or “has”),

  9. 9.

    ‘SHI’ (tagged as “是”, i.e. “is” or “are”)

To determine the end-of-sentence in opinion, we use following P.O.S tags as sentence boundary:

  1. 1.

    “FW”,

  2. 2.

    “QUESTIONCATEGORY”,

  3. 3.

    “COLONCATEGORY”,

  4. 4.

    “COMMACATEGORY”,

  5. 5.

    “DASHCATEGORY”,

  6. 6.

    “ETCCATEGORY”,

  7. 7.

    “PARENTHESISCATEGORY”,

  8. 8.

    “PAUSECATEGORY”,

  9. 9.

    “PERIODCATEGORY”,

  10. 10.

    “QUESTIONCATEGORY”,

  11. 11.

    “SEMICOLONCATEGORY”,

  12. 12.

    “EXCLANATIONCATEGORY”,

  13. 13.

    “BR”,//HTML mark for end of sentence

  14. 14.

    “SPCHANGECATEGORY”

For more information of Sinica CKIP tagging, please refer to http://ckipsvr.iis.sinica.edu.tw/cat.htm

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, HL., Chao, A.F.Y. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations. Inf Syst Front 17, 1335–1352 (2015). https://doi.org/10.1007/s10796-014-9498-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-014-9498-1

Keywords

Navigation