Abstract
Text psycholinguistic features are a valuable source for various research topics since they are used to obtain psychological, social, and linguistic aspects from written texts using dictionary files. These files are structured in categories, which are defined as groups of dictionary words that tap a particular domain (e.g., negative emotion words). The Linguistic Inquiry Word Count (LIWC) is a vastly used and versatile computer-based language analysis tool designed for text psycholinguistic analysis. The most recent version of the default English dictionary is LIWC2015, as it was released with the 2015 version of the LIWC software. The literature has recently introduced the latest Brazilian Portuguese LIWC dictionary (BP-LIWC2015), developed with the same categories as the LIWC 2015 English dictionary. However, the literature has also reported the need to evaluate BP-LIWC2015. In this scenario, this work investigates three questions: (i) Since LIWC2015 shows consistent improvements over the English dictionary developed in 2007 (LIWC2007), does BP-LIWC2015 achieves better text classification results than the older Brazilian Portuguese dictionary (BP-LIWC2007)? (ii) What is the equivalence between BP-LIWC2015 and BP-LIWC2007 with LIWC2015? (iii) Are there significant differences between Brazilian Portuguese dictionaries? To answer these questions, we conducted text classification experiments with four datasets and seven classification algorithms to compare the two Brazilian Portuguese LIWC dictionaries reported in the literature (i.e., 2007 and 2015). Second, we used a bilingual Portuguese-English scientific news collection to analyze the correlation between LIWC2015 and Brazilian Portuguese LIWC dictionaries. The results indicate that BP-LIWC2015 outperforms the older version in Brazilian Portuguese text classification. Finally, we found a more significant correlation between BP-LIWC2015 and the original English dictionary than the older version.
Similar content being viewed by others
Notes
The median was chosen as the dispersion measure since the values found for most categories corresponded to a non-normal distribution according to the D’Agostino-Pearson test (D’Agostino & Belanger, 1990).
References
Aggarwal, C. C. (Ed.). (2011). Social Network Data Analytics. Boston, MA: Springer US. https://doi.org/10.1007/978-1-4419-8462-3.
Aires, R. et al. (2004). Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users. needs? report. Available at: https://comum.rcaap.pt/handle/10400.26/363?mode=full (Accessed: 24 May 2021).
Al-Rfou, R., Perozzi, B., & Skiena, S. (2014). ‘Polyglot: Distributed Word Representations for Multilingual NLP’, arXiv:1307.1662 [cs]. Available at: http://arxiv.org/abs/1307.1662 (Accessed: 10 June 2021).
Aziz, W., & Specia, L. (2011). ‘Fully automatic compilation of a Portuguese-English parallel corpus for statistical machine translation’, in Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology. STIL, Cuiabá, Brazil.
Balage Filho, P. P., Pardo, T. A. S., & Aluisio, S. M. (2013). ‘An evaluation of the Brazilian Portuguese LIWC Dictionary for sentiment analysis’, Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (STIL), pp. 215–219.
Barbosa, A. et al. (2021). ‘The impact of automatic text translation on classification of online discussions for social and cognitive presences’, in LAK21: 11th International Learning Analytics and Knowledge Conference. LAK21: 11th International Learning Analytics and Knowledge Conference, Irvine CA USA: ACM, pp. 77–87. https://doi.org/10.1145/3448139.3448147.
Becker, K., & Tumitan, D. (2013). Introdução à mineração de opiniões: Conceitos, aplicações e desafios. Simpósio brasileiro de banco de dados, 75, 27–52.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798–1828.
Bryman, A., & Bell, E. (2015). Business Research Methods. 4a edição. Cambridge, United Kingdom; New York, NY, United States of America: OUP Oxford.
Calvo, R. A., & D’Mello, S. (2010). Affect detection: an Interdisciplinary Review of Models, Methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37. https://doi.org/10.1109/T-AFFC.2010.1.
Cambria, E., et al. (2013). New Avenues in Opinion Mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. https://doi.org/10.1109/MIS.2013.30.
Carvalho, F. et al. (2019). ‘Evaluating the Brazilian Portuguese version of the 2015 LIWC Lexicon with sentiment analysis in social networks’, in Proceedings of the VIII Brazilian Workshop on Social Network Analysis and Mining. (BraSNAM), Belém, PA, Brazil: SBC, pp. 24–34. https://doi.org/10.5753/brasnam.2019.6545.
Carvalho, F., Santos, G., & Guedes, G. P. (2018). ‘AffectPT-br: an Affective Lexicon based on LIWC 2015’, in 37th International Conference of the Chilean Computer Science Society. (SCCC), Santiago, Chile, pp. 1–5. https://doi.org/10.1109/SCCC.2018.8705251.
Dandannavar, P. S., Mangalwede, S. R., & Deshpande, S. B. (2020). ‘Emoticons and Their Effects on Sentiment Analysis of Twitter Data’, in Haldorai, A. et al. (eds) EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing. Cham: Springer International Publishing (EAI/Springer Innovations in Communication and Computing), pp. 191–201. https://doi.org/10.1007/978-3-030-19562-5_19.
Dudău, D. P., & Sava, F. A. (2020). ‘The development and validation of the Romanian version of Linguistic Inquiry and Word Count 2015 (Ro-LIWC2015)’, Current Psychology. https://doi.org/10.1007/s12144-020-00872-4.
Eichstaedt, J. C., Kern, M. L., Yaden, D. B., Schwartz, H. A., Giorgi, S., Park, G., Hagan, C. A., Tobolsky, V. A., Smith, L. K., Buffone, A., & Iwry, J. (2021). Closed-and open-vocabulary approaches to text analysis: a review, quantitative comparison, and recommendations. Psychological Methods, 26(4), 398–427. https://doi.org/10.1037/met0000349.
Falaki, H. et al. (2010). ‘Diversity in smartphone usage’, in Proceedings of the 8th international conference on Mobile systems, applications, and services. New York, NY, USA: Association for Computing Machinery (MobiSys ’10), pp. 179–194. https://doi.org/10.1145/1814433.1814453.
Fersini, E., Pozzi, F. A., & Messina, E. (2015). ‘Detecting irony and sarcasm in microblogs: The role of expressive signals and ensemble classifiers’, in Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics. (DSAA), pp. 1–8. https://doi.org/10.1109/DSAA.2015.7344888.
Flake, G. W., & Lawrence, S. (2002). Efficient SVM regression training with SMO. Machine Learning, 46(1), 271–290. https://doi.org/10.1023/A:1012474916001.
Fornaciari, T., et al. (2020). Fake opinion detection: how similar are crowdsourced datasets to real data? Language Resources and Evaluation, 54(4), 1019–1058. https://doi.org/10.1007/s10579-020-09486-5.
Fukunaga, K. (1990). Introduction to statistical pattern recognition. Academic Press, second edition.
Gabrilovich, E., & Markovitch, S. (2004). ‘Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5’, in Proceedings of the twenty-first international conference on Machine learning. New York, NY, USA: Association for Computing Machinery (ICML ’04), p. 41. https://doi.org/10.1145/1015330.1015388.
Grimmer, J., & Stewart, B. M. (2013). Text as data: the Promise and Pitfalls of Automatic Content Analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028.
Hall, M., et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278.
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. 3rd edition. Haryana, India; Burlington, MA: Morgan Kaufmann.
Hernández Farías, D. I., Ortega-Mendoza, R. M., & Montes-y-Gómez, M. (2019). Exploring the Use of Psycholinguistic Information in author profiling. In J. A. Carrasco-Ochoa, et al. (Eds.), Pattern recognition (pp. 411–421). Cham: Springer International Publishing. Lecture Notes in Computer Science10.1007/978-3-030-21077-9_38.
Ho, T. K. (1995). ‘Random decision forests’, in Proceedings of 3rd International Conference on Document Analysis and Recognition. Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 278–282 vol.1. https://doi.org/10.1109/ICDAR.1995.598994.
Kleinbaum, D. G., et al. (2002). Logistic regression. New York: Springer-Verlag.
Kohavi, R. (1995). ‘A study of cross-validation and bootstrap for accuracy estimation and model selection’, in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. (IJCAI’95), pp. 1137–1143.
Lan, M., et al. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE transactions on pattern analysis and machine intelligence, 31(4), 721–735. https://doi.org/10.1109/TPAMI.2008.110.
Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1), 161–205. https://doi.org/10.1007/s10994-005-0466-3.
Langley, P., And, W. I., & Thompson, K. (1992). ‘An analysis of Bayesian classifiers’, in Proceedings of the tenth national conference on Artificial intelligence. San Jose, California: AAAI Press (AAAI’92), pp. 223–228.
Läubli, S. (2020). ‘Machine Translation for Professional Translators’. https://doi.org/10.5167/UZH-193466.
Liu, B. (2012). Sentiment analysis and opinion mining. Morgan & Claypool Publishers.
Liu, B. (2020). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. 2nd edition. Cambridge University Press.
McCallum, A., & Nigam, K. (1998). ‘A Comparison of Event Models for Naive Bayes Text Classification’, in Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41–48. Available at: http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf (Accessed: 25 May 2021).
Meier, T. et al. (2019). ‘“LIWC auf Deutsch”: The Development, Psychometrics, and Introduction of DE- LIWC2015’. PsyArXiv. https://doi.org/10.31234/osf.io/uq8zt.
Mello, R. F. et al. (2021). ‘Towards Automatic Content Analysis of Rhetorical Structure in Brazilian College Entrance Essays’, in International Conference on Artificial Intelligence in Education. Springer, pp. 162–167.
Midhun, M. E., Nair, S. R., Prabhakar, V. N., & Kumar, S. S. (2014). ‘Deep model for classification of hyperspectral image using restricted boltzmann machine’, in Proceedings of the 2014 international conference on interdisciplinary advances in applied computing (pp. 1–7).
Noether, G. E. (1981). Why Kendall Tau? Teaching Statistics, 3(2), 41–43. https://doi.org/10.1111/j.1467-9639.1981.tb00422.x.
Pennebaker, J. W. (2013). The Secret Life of Pronouns: What Our Words Say About Us. Reprint edition. New York: Bloomsbury Publishing.
Pennebaker, J. W. et al. (2015). ‘The Development and Psychometric Properties of LIWC2015’. Available at: https://repositories.lib.utexas.edu/handle/2152/31333 (Accessed: 23 May 2021).
Pettijohn, T. F., & Sacco, D. F. (2009). The Language of lyrics: an analysis of Popular Billboard Songs Across Conditions of Social and economic threat. Journal of Language and Social Psychology, 28(3), 297–311. https://doi.org/10.1177/0261927X09335259.
del Salas-Zárate, P., M., et al. (2014). A study on LIWC categories for opinion mining in spanish reviews. Journal of Information Science, 40(6), 749–760. https://doi.org/10.1177/0165551514547842.
Platt, J. (1998). ‘Fast Training of Support Vector Machines Using Sequential Minimal Optimization’. Available at: https://www.microsoft.com/en-us/research/publication/fast-training-of-support-vector-machines-using-sequential-minimal-optimization/ (Accessed: 8 June 2021).
Pranckevičius, T., & Marcinkevičius, V. (2016). November). Application of logistic regression with part-of-the-speech tagging for multi-class text classification. 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE) (pp. 1–5). IEEE.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251.
Quinlan, J. R. (2014). C4.5: programs for machine learning. Elsevier.
Rodrigues, R. G. et al. (2017). Inferência de idade utilizando o LIWC: identificando potenciais predadores sexuais. Anais do VI brazilian workshop on Social Network Analysis and Mining. BraSNAM 2017. São Paulo, Brazil: SBC.
Rude, S. S., Gortner, E. M., & Pennebaker, J. W. (2004). Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18(8), 1121–1133. https://doi.org/10.1080/02699930441000030.
Santos, R. et al. (2016). ‘Evaluating the importance of Web comments through metrics extraction and opinion mining’, in 2016 35th International Conference of the Chilean Computer Science Society (SCCC). 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–11. https://doi.org/10.1109/SCCC.2016.7836039.
Schler, J. et al. (2006). ‘Effects of Age and Gender on Blogging.’, in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI, pp. 199–205. Available at: http://dblp.uni-trier.de/db/conf/aaaiss/aaaiss2006-3.html#SchlerKAP06 (Accessed: 23 May 2021).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.
Sender, G., Carvalho, F., & Guedes, G. (2021). The happy level: a New Approach to measure happiness at work using mixed methods. International Journal of Qualitative Methods, 20, 16094069211002412. https://doi.org/10.1177/16094069211002413.
Shibata, D. et al. (2016). ‘Detecting Japanese Patients with Alzheimer’s Disease based on Word Category Frequencies’, in Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 78–85. Available at: https://www.aclweb.org/anthology/W16-4211 (Accessed: 23 May 2021).
Silva, M. J., Carvalho, P., & Sarmento, L. (2012). ‘Building a sentiment lexicon for social judgement mining’, in Proceedings of the 10th international conference on Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer-Verlag (PROPOR’12), pp. 218–228. https://doi.org/10.1007/978-3-642-28885-2_25.
Souza, M. et al. (2011). ‘Construction of a Portuguese Opinion Lexicon from multiple resources’, in 8th Brazilian Symposium in Information and Human Language Technology. STIL, Mato Grosso, Brazil.
Svetnik, V., et al. (2003). Random Forest: a classification and regression Tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958. https://doi.org/10.1021/ci034160g.
Tang, C., & Guo, L. (2015). Digging for gold with a simple tool: validating text mining in studying electronic word-of-mouth (eWOM) communication. Marketing Letters, 26(1), 67–80. https://doi.org/10.1007/s11002-013-9268-8.
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and Computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676.
Wang, S., & Manning, C. D. (2012). ‘Baselines and bigrams: simple, good sentiment and topic classification’, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. USA: Association for Computational Linguistics (ACL ’12), pp. 90–94.
Wissen, L. (2017). van and Boot, P. ‘An Electronic Translation of the LIWC Dictionary into Dutch’. Available at: https://pure.knaw.nl/portal/en/publications/an-electronic-translation-of-the-liwc-dictionary-into-dutch (Accessed: 23 May 2021).
Yin, Y., et al. (2019). A Lexical Resource-Constrained Topic Model for Word Relatedness. Ieee Access : Practical Innovations, Open Solutions, 7, 55261–55268. https://doi.org/10.1109/ACCESS.2019.2909104.
Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1), 43–52. https://doi.org/10.1007/s13042-010-0001-0.
Acknowledgements
Authors acknowledge CAPES by financial support. This work was development with financial support of ‘Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)’ by financial code 001.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Carvalho, F., Junior, F.P., Ogasawara, E. et al. Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015). Lang Resources & Evaluation 58, 203–222 (2024). https://doi.org/10.1007/s10579-023-09647-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09647-2