Sentiment classification of Internet restaurant reviews written in Cantonese

https://doi.org/10.1016/j.eswa.2010.12.147Get rights and content

Abstract

Cantonese is an important dialect in some regions of Southern China. Local online users often represent their opinions and experiences on the web with written Cantonese. Although the information in those reviews is valuable to potential consumers and sellers, the huge amount of web reviews make it difficult to give an unbiased evaluation to a product and the Cantonese reviews are unintelligible for Mandarin Chinese speakers.

In this paper, standard machine learning techniques naive Bayes and SVM are incorporated into the domain of online Cantonese-written restaurant reviews to automatically classify user reviews as positive or negative. The effects of feature presentations and feature sizes on classification performance are discussed. We find that accuracy is influenced by interaction between the classification models and the feature options. The naive Bayes classifier achieves as well as or better accuracy than SVM. Character-based bigrams are proved better features than unigrams and trigrams in capturing Cantonese sentiment orientation.

Research highlights

► Naive Bayes and SVM are used for Cantonese sentiment classification. ► Accuracy is influenced by interaction between classification models and features. ► Naive Bayes classifier achieves as well as or better accuracy than SVM. ► Character-based bigrams are better features than unigrams and trigrams in capturing Cantonese sentiment.

Introduction

The Internet continues to become an essential part of everyday life. people are now able to access not only opinions from family members and friends, but also from strangers located around the world who may have used a particular product, visited a certain destination, or seen a movie. Internet provides a virtual environment for consumers to share their experiences with world-wide travelers via the electronic word-of-mouth (WOM) communication channel (Cheung, Shek, & Sia, 2004). The importance of WOM has been widely documented in the existing literature (Cheung et al., 2004, Goldenberg et al., 2001). WOM not only strongly influences consumers’ decision making process (Goldenberg et al., 2001), but also has important implications for managers to consider their brand building, product development, and quality assurance (Dellarocas, 2003).

As today’s consumers are increasingly making their opinions and experiences available online (Horrigan, 2008), there have accumulated a huge amount of consumer reviews for products or service on the Web. When trying to locate user opinions of a product, a general online search will turns up millions of web pages. Getting an overall sense of those reviews can be daunting or time-consuming, however, if only few reviews were read the evaluation would be biased. Sentiment classification aims to address this problem by automatically classifying user reviews into positive or negative opinions.

Review sentiment classification has become one of the foci of recent research endeavors. Many sentiment classification techniques have been developed for English, Japanese, and Mandarin Chinese. But the interest in the sentiment analysis is worldwide to provide support for various NLP applications. Researches on automatic sentiment analysis should be conducted in more new languages such as the Cantonese.

Cantonese is an important dialect spoken in and around the cities of southern China where are typical areas with rapid development in China. In those areas, Cantonese is widely used in social settings and many native Cantonese consumers are not well literate in Mandarin Chinese. Take Hong Kong for example. According to statistics of Hong Kong Census and Statistics Department for 2006 population, Cantonese was the most commonly used language at home for about 91% of the population. Only about 40% of the population claimed to be able to speak Mandarin Chinese,1 and the percent capable of writing would be less. Those Cantonese-speaking consumers are very likely to express themselves with written Cantonese in informal settings such as Internet forum; however, due to the difference between Cantonese and Mandarin Chinese, Mandarin speakers cannot read the online Cantonese contents (or finds it so difficult that the effort will rapidly be abandoned). Given the importance of written Cantonese (Snow, 2004), innovative techniques that can automatically detect the consumer opinions in Cantonese reviews are urgently required.

In this paper, standard machine learning techniques are incorporated into the domain of online Cantonese-written restaurant reviews to automatically classify user reviews as thumbs-up or thumbs-down. Two popular text classification algorithms – naive Bayes and SVM, and six feature presentations concerning n-gram presence/frequency are chosen to examine the effects of the classifiers and the feature options on Cantonese sentiment classification. This study seeks empirical answers to the following research questions:

  • 1.

    Dose the SVM classifier beat naive Bayes regarding Cantonese sentiment-based classification?

  • 2.

    Are high order n-grams better features than unigrams to capture sentiments in the Cantonese text?

  • 3.

    Is feature presence a better text presentation than feature frequency regarding feature selection and text classification?

  • 4.

    How dose the size of feature set affect the performance of classifiers?

Section snippets

Literature review

Sentiment classification aims to automatically classify the text of written reviews from customers into positive or negative opinions. It has emerged as a hot research area. While it is still in a preliminary stage, there has been much work related to various languages, such as English (Liu et al., 2005, Pang et al., 2002), Japanese (Fujii & Ishikawa, 2006), Mandarin Chinese (Ku, Liang, & Chen, 2006).

In this paper, we focus our interest on written Cantonese which can be viewed as a written

Data collection

Due to no benchmark data available, we created a corpus of Cantonese-written reviews by retrieving consumer reviews from a Cantonese site OpenRice (URL: http://www.openrice.com). The site allows diners to input text feedback and a three-point satisfaction rating for a restaurant located in Hong Kong. As the majority of OpenRice users are inhabitants of Hong Kong, the feedback are generally written in Cantonese with a few exceptions in English and Mandarin. A crawler was developed by Java to

Performance measures

The category assignments of a polarity classifier can be evaluated using a two-way contingency table (Table 3) which has four cells, where

  • cell a counts the documents correctly assigned to positive reviews;

  • cell b counts the documents incorrectly assigned to positive reviews;

  • cell c counts the documents incorrectly assigned to negative reviews;

  • cell d counts the documents correctly assigned to negative reviews.

The performance measures recall, precision and accuracy are defined and computed from the

Results and discussion

Three-fold cross-validation was performed for the experiments reported in this study. The experiments used our own implementation of a naive Bayes classifier and Chang and Lin’s (2001) LIBSVM implementation of a Support Vector Machine classifier with all parameters set to their default values. We ran each classifier with various-sized feature sets to examine the effects of feature size on sentiment classification performance. Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig.

Conclusion

This paper has shown that machine learning techniques perform quite well in the domain of Cantonese review classification. Despite its unrealistic independence assumption, the naive Bayes classifier surprisingly achieves comparable, or better performance than SVM. Interactions between classification methods and feature presentation options are observed, and bigram frequency is proved the effective feature in capturing sentiments in the Cantonese text. In addition, we look at the effects of

Acknowledgments

This study was partially funded by National Science Foundation of China (70971033, 70890082) and NCET-08-0172.

References (29)

  • Q. Ye et al.

    Sentiment classification of online reviews to travel destinations by supervised machine learning approaches

    Expert Systems with Applications

    (2009)
  • Chang, C.-C., Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Software available at...
  • K. Cheung et al.

    The representation of Cantonese with Chinese characters

    (2002)
  • Cheung, C. M. Y., Shek, S. P. W., Sia, C. L. (2004). Virtual community of consumers: Why people are willing to...
  • S.R. Das et al.

    Yahoo! for Amazon: Sentiment extraction from small talk on the web

    Management Science

    (2007)
  • Dave, K., Lawrence, S., Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic...
  • C. Dellarocas

    The digitization of word of mouth: Promise and challenges of online feedback mechanisms

    Management Science

    (2003)
  • P. Domingos et al.

    Beyond independence: Conditions for the optimality of the simple Bayesian classifier

    Machine Learning

    (1997)
  • Fujii, A., Ishikawa, T. (2006). A system for summarizing and visualizing arguments in subjective documents: Toward...
  • J. Goldenberg et al.

    Talk of the network: A complex systems look at the underlying process of word-of-mouth

    Marketing Letters

    (2001)
  • U. Gretzel et al.

    Use and impact of online travel reviews

    (2008)
  • Hatzivassiloglou, V., McKeown, K. (1997). Predicting the semantic orientation of adjectives. In: Proceedings of the...
  • M. Hearst

    Direction-based text interpretation as an information access refinement

    (1992)
  • Horrigan, J. A. (2008). Online shopping, pew Internet & American life project...
  • Cited by (187)

    View all citing articles on Scopus
    View full text