Skip to main content
Log in

A novel feature extraction methodology for sentiment analysis of product reviews

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Feature extraction is one of the key steps for text sentiment analysis (SA), and the corresponding algorithms have important effect on the results. In the paper, a novel methodology is proposed to extract the feature for SA of product reviews. First, based on the diversified expression forms of product reviews, the generalized TF–IDF feature vectors are obtained by introducing the semantic similarity of synonyms. Then, in view of the different lengths of product reviews, the local patterns of the feature vectors are identified with OPSM biclustering algorithm. Finally, we improve PrefixSpan algorithm to detect the frequent and pseudo-consecutive phrases with high discriminative ability (namely FPCD phrases), which contain word-order information. Furthermore, some important factors, such as the separation and discriminative ability of words, are also employed to improve the discriminative ability of sentiment polarity. Based on the previous steps, the text feature vectors are extracted. A series of the experiment and comparison results indicate that the performance for SA on product review is greatly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP), pp 79–86

  2. Tan S, Zhang J (2008) An empirical study of sentiment analysis for Chinese documents. Expert Syst Appl 34(4):2622–2629

    Article  Google Scholar 

  3. Zhang HJ, Ji Y, Li J, Ye Y (2016) A triple wing harmonium model for movie recommendation. IEEE Trans Ind Inf 12(1):231–239

    Article  Google Scholar 

  4. Zhang Y (2015) Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM 2015, pp 435–440

  5. Yaakub MR, Li Y, Zhang J (2013) Integration of sentiment analysis into customer relational model: the importance of feature ontology and synonym. Procedia Technol 11:495–501

    Article  Google Scholar 

  6. Wang W, Tan G, Wang H (2016) Cross-domain comparison of algorithm performance in extracting aspect-based opinions from Chinese online reviews. Int J Mach Learn Cybern 8(3):1–18

    Google Scholar 

  7. Basu T, Murthy C (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892

    Article  Google Scholar 

  8. Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606

    Article  Google Scholar 

  9. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  10. Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10(3–4):373–384

    Article  Google Scholar 

  11. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the Prefixspan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

    Article  Google Scholar 

  12. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. Meet Assoc Comput Linguist Hum Lang Technol 2011:142–150

    Google Scholar 

  13. Salton G, Yu CT (1974) On the construction of effective vocabularies for information retrieval. ACM SIGIR Forum 9(3):48–60

    Article  Google Scholar 

  14. Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp 246–252

  15. Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. International Conference on Neural Information Processing Systems, pp 1081–1088

  16. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Proceedings of Workshop at International Conference on Learning Representations, pp 1–12

  17. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. Conf Empir Methods Nat Lang Proc 2014:1532–1543

    Google Scholar 

  18. Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Comput Sci 5(1):36

    Google Scholar 

  19. Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:160704606

  20. Wang Y, Liu Z, Sun M (2015) Incorporating linguistic knowledge for learning distributed word representations. PLoS ONE 10(4):e0118437

    Article  Google Scholar 

  21. Matsumoto S, Takamura H, Okumura M (2005) Sentiment classification using word sub-sequences and dependency sub-trees. In: Pacific-Asia conference on knowledge discovery and data mining, 2005. Springer, pp 301–311

  22. Dong Z, Dong Q (2003) HowNet—a hybrid language and knowledge resource. Int Conf Nat Lang Process Knowl Eng Proc 2003:820–824

    Google Scholar 

  23. Yuan B, Liu Y, Li H (2013) Sentiment classification in Chinese microblogs: lexicon-based and learning-based approaches. Int Proc Econ Dev Res 68:1

    Google Scholar 

  24. Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  25. Esuli A, Sebastiani F (2006) Sentiwordnet: a publicly available lexical resource for opinion mining. Proceedings of the 5th Conference on Language Resources and Evaluation, pp 417–422

  26. Xu R, Chen T, Xia Y, Lu Q, Liu B, Wang X (2015) Word embedding composition for data imbalances in sentiment and emotion classification. Cogn Comput 7(2):226–240

    Article  Google Scholar 

  27. Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, Hoboken

    MATH  Google Scholar 

  28. Törönen P, Kolehmainen M, Wong G, Castren E (1999) Analysis of gene expression data using self-organizing maps. FEBS Lett 451(2):142–146

    Article  Google Scholar 

  29. Xu JH, Liu H (2010) Web user clustering analysis based on Kmeans algorithm. In: 2010 international conference on information, networking and automation, 2010, pp V2-6–V2-9

  30. Xue Y, Liu ZW, Luo J, Ma ZH, Zhang MZ, Hu XH, Kuang QH (2015) Stock market trading rules discovery based on biclustering method. Math Probl Eng 2015:1–13

    Article  Google Scholar 

  31. Cheng Y, Church GM (2000) Biclustering of expression data. Int Conf Intell Syst Mol Biol 2000:93

    Google Scholar 

  32. Yang J, Wang W, Wang H (2002)/spl delta/-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th international conference on data engineering 2002, pp 517–528

  33. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86

    MathSciNet  MATH  Google Scholar 

  34. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45

    Article  Google Scholar 

  35. Liu ZW, Xue Y, Li MH, Ma B, Zhang MZ, Chen X, Hu XH (2017) Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining. Int J Data Min Bioinform 17(3):217–237

    Article  Google Scholar 

  36. Wang H (2007) All common subsequences. In: Proceedings of the international joint conference on artificial intelligence, 2007, pp 635–640

  37. Han JW, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) Freespan: frequent pattern-projected sequential pattern mining. Paper presented at the proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000, pp 355–359

  38. Peterson EA, Tang P (2008) Mining frequent sequential patterns with first-occurrence forests. In: Proceedings of the 46th annual southeast regional conference on XX. ACM, 2008, pp 34–39

  39. Zhang HP, Yu HK, Xiong DY, Liu Q (2003) HHMM-based Chinese lexical analyzer ICTCLAS. Sighan Workshop on Chinese Language Processing, pp 758–759

  40. Wang C, Zhang M, Ma S, Ru L (2008) Automatic online news issue construction in web environment. Int Conf World Wide Web 2008:457–466

    Google Scholar 

  41. Hashimoto TB, Alvarezmelis D, Jaakkola TS (2015) Word, graph and manifold embedding from Markov processes. New Media & Society, pp 1–6

  42. Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, Mcclosky D (2014) The Stanford Corenlp Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 55–60

  43. Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl Based Syst 67(3):105–116

    Article  Google Scholar 

  44. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  Google Scholar 

  45. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830

    MathSciNet  MATH  Google Scholar 

  46. Goodfellow I, Courville A, Bengio Y (2012) Large-scale feature learning with spike-and-slab sparse coding. Proceedings of the 29th International Conference on Machine Learning, pp 1439–1446

  47. Zhang HJ, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537

    Article  Google Scholar 

  48. Zhang HJ, Li J, Ji Y, Yue H (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inform 13(2):616–624

    Article  Google Scholar 

  49. Zhang HJ, Cao X, Ho JKL, Chow TWS (2016) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531

    Article  Google Scholar 

  50. Oyedotun OK, Khashman A (2016) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 2016:1–11

    Google Scholar 

Download references

Acknowledgements

The authors thank gratefully for the colleagues participated in this work and provided technical supports. This work is supported by Guangdong Provincial Engineering Technology Research Center for Data Science (Nos. 2016KF09, 2016KF10), and the National Statistical Science Research Project of China (Nos. 2015LY81, 2016LY98). This work was also supported by the Science and Technology Department of Guangdong Province in China (Grant Nos. 2016A010101020, 2016A010101021, 2016A010101022), the grant from Guangdong Province Science and Technology Planning Project (No. 2013B040404009), Foundation of Guangdong Polytechnic of Science and Technology (No. XJSC2016206), Natural Science Funds of Shenzhen Science and Technology Innovation Commission (No. JCYJ20160527172144272) and the Innovation Project of Graduate School of South China Normal University (No. 2015lkxm37).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Xue.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Xue, Y., Zhao, H. et al. A novel feature extraction methodology for sentiment analysis of product reviews. Neural Comput & Applic 31, 6625–6642 (2019). https://doi.org/10.1007/s00521-018-3477-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3477-2

Keywords

Navigation