Skip to main content
Log in

Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Text psycholinguistic features are a valuable source for various research topics since they are used to obtain psychological, social, and linguistic aspects from written texts using dictionary files. These files are structured in categories, which are defined as groups of dictionary words that tap a particular domain (e.g., negative emotion words). The Linguistic Inquiry Word Count (LIWC) is a vastly used and versatile computer-based language analysis tool designed for text psycholinguistic analysis. The most recent version of the default English dictionary is LIWC2015, as it was released with the 2015 version of the LIWC software. The literature has recently introduced the latest Brazilian Portuguese LIWC dictionary (BP-LIWC2015), developed with the same categories as the LIWC 2015 English dictionary. However, the literature has also reported the need to evaluate BP-LIWC2015. In this scenario, this work investigates three questions: (i) Since LIWC2015 shows consistent improvements over the English dictionary developed in 2007 (LIWC2007), does BP-LIWC2015 achieves better text classification results than the older Brazilian Portuguese dictionary (BP-LIWC2007)? (ii) What is the equivalence between BP-LIWC2015 and BP-LIWC2007 with LIWC2015? (iii) Are there significant differences between Brazilian Portuguese dictionaries? To answer these questions, we conducted text classification experiments with four datasets and seven classification algorithms to compare the two Brazilian Portuguese LIWC dictionaries reported in the literature (i.e., 2007 and 2015). Second, we used a bilingual Portuguese-English scientific news collection to analyze the correlation between LIWC2015 and Brazilian Portuguese LIWC dictionaries. The results indicate that BP-LIWC2015 outperforms the older version in Brazilian Portuguese text classification. Finally, we found a more significant correlation between BP-LIWC2015 and the original English dictionary than the older version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The median was chosen as the dispersion measure since the values found for most categories corresponded to a non-normal distribution according to the D’Agostino-Pearson test (D’Agostino & Belanger, 1990).

  2. To clarify the procedure conducted, revisit Table 1. After the n texts have been processed, each column’s max, min, and median values (i.e., category) are found. Thus, considering the five values presented in Table 1, the Function words category has a median equal to 21.73.

References

  • Aggarwal, C. C. (Ed.). (2011). Social Network Data Analytics. Boston, MA: Springer US. https://doi.org/10.1007/978-1-4419-8462-3.

    Book  Google Scholar 

  • Aires, R. et al. (2004). Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users. needs? report. Available at: https://comum.rcaap.pt/handle/10400.26/363?mode=full (Accessed: 24 May 2021).

  • Al-Rfou, R., Perozzi, B., & Skiena, S. (2014). ‘Polyglot: Distributed Word Representations for Multilingual NLP’, arXiv:1307.1662 [cs]. Available at: http://arxiv.org/abs/1307.1662 (Accessed: 10 June 2021).

  • Aziz, W., & Specia, L. (2011). ‘Fully automatic compilation of a Portuguese-English parallel corpus for statistical machine translation’, in Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology. STIL, Cuiabá, Brazil.

  • Balage Filho, P. P., Pardo, T. A. S., & Aluisio, S. M. (2013). ‘An evaluation of the Brazilian Portuguese LIWC Dictionary for sentiment analysis’, Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (STIL), pp. 215–219.

  • Barbosa, A. et al. (2021). ‘The impact of automatic text translation on classification of online discussions for social and cognitive presences’, in LAK21: 11th International Learning Analytics and Knowledge Conference. LAK21: 11th International Learning Analytics and Knowledge Conference, Irvine CA USA: ACM, pp. 77–87. https://doi.org/10.1145/3448139.3448147.

  • Becker, K., & Tumitan, D. (2013). Introdução à mineração de opiniões: Conceitos, aplicações e desafios. Simpósio brasileiro de banco de dados, 75, 27–52.

    Google Scholar 

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798–1828.

    Article  PubMed  Google Scholar 

  • Bryman, A., & Bell, E. (2015). Business Research Methods. 4a edição. Cambridge, United Kingdom; New York, NY, United States of America: OUP Oxford.

  • Calvo, R. A., & D’Mello, S. (2010). Affect detection: an Interdisciplinary Review of Models, Methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37. https://doi.org/10.1109/T-AFFC.2010.1.

    Article  Google Scholar 

  • Cambria, E., et al. (2013). New Avenues in Opinion Mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. https://doi.org/10.1109/MIS.2013.30.

    Article  Google Scholar 

  • Carvalho, F. et al. (2019). ‘Evaluating the Brazilian Portuguese version of the 2015 LIWC Lexicon with sentiment analysis in social networks’, in Proceedings of the VIII Brazilian Workshop on Social Network Analysis and Mining. (BraSNAM), Belém, PA, Brazil: SBC, pp. 24–34. https://doi.org/10.5753/brasnam.2019.6545.

  • Carvalho, F., Santos, G., & Guedes, G. P. (2018). ‘AffectPT-br: an Affective Lexicon based on LIWC 2015’, in 37th International Conference of the Chilean Computer Science Society. (SCCC), Santiago, Chile, pp. 1–5. https://doi.org/10.1109/SCCC.2018.8705251.

  • Dandannavar, P. S., Mangalwede, S. R., & Deshpande, S. B. (2020). ‘Emoticons and Their Effects on Sentiment Analysis of Twitter Data’, in Haldorai, A. et al. (eds) EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing. Cham: Springer International Publishing (EAI/Springer Innovations in Communication and Computing), pp. 191–201. https://doi.org/10.1007/978-3-030-19562-5_19.

  • Dudău, D. P., & Sava, F. A. (2020). ‘The development and validation of the Romanian version of Linguistic Inquiry and Word Count 2015 (Ro-LIWC2015)’, Current Psychology. https://doi.org/10.1007/s12144-020-00872-4.

  • Eichstaedt, J. C., Kern, M. L., Yaden, D. B., Schwartz, H. A., Giorgi, S., Park, G., Hagan, C. A., Tobolsky, V. A., Smith, L. K., Buffone, A., & Iwry, J. (2021). Closed-and open-vocabulary approaches to text analysis: a review, quantitative comparison, and recommendations. Psychological Methods, 26(4), 398–427. https://doi.org/10.1037/met0000349.

    Article  PubMed  Google Scholar 

  • Falaki, H. et al. (2010). ‘Diversity in smartphone usage’, in Proceedings of the 8th international conference on Mobile systems, applications, and services. New York, NY, USA: Association for Computing Machinery (MobiSys ’10), pp. 179–194. https://doi.org/10.1145/1814433.1814453.

  • Fersini, E., Pozzi, F. A., & Messina, E. (2015). ‘Detecting irony and sarcasm in microblogs: The role of expressive signals and ensemble classifiers’, in Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics. (DSAA), pp. 1–8. https://doi.org/10.1109/DSAA.2015.7344888.

  • Flake, G. W., & Lawrence, S. (2002). Efficient SVM regression training with SMO. Machine Learning, 46(1), 271–290. https://doi.org/10.1023/A:1012474916001.

    Article  Google Scholar 

  • Fornaciari, T., et al. (2020). Fake opinion detection: how similar are crowdsourced datasets to real data? Language Resources and Evaluation, 54(4), 1019–1058. https://doi.org/10.1007/s10579-020-09486-5.

    Article  Google Scholar 

  • Fukunaga, K. (1990). Introduction to statistical pattern recognition. Academic Press, second edition.

  • Gabrilovich, E., & Markovitch, S. (2004). ‘Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5’, in Proceedings of the twenty-first international conference on Machine learning. New York, NY, USA: Association for Computing Machinery (ICML ’04), p. 41. https://doi.org/10.1145/1015330.1015388.

  • Grimmer, J., & Stewart, B. M. (2013). Text as data: the Promise and Pitfalls of Automatic Content Analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028.

    Article  Google Scholar 

  • Hall, M., et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278.

    Article  Google Scholar 

  • Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. 3rd edition. Haryana, India; Burlington, MA: Morgan Kaufmann.

  • Hernández Farías, D. I., Ortega-Mendoza, R. M., & Montes-y-Gómez, M. (2019). Exploring the Use of Psycholinguistic Information in author profiling. In J. A. Carrasco-Ochoa, et al. (Eds.), Pattern recognition (pp. 411–421). Cham: Springer International Publishing. Lecture Notes in Computer Science10.1007/978-3-030-21077-9_38.

    Chapter  Google Scholar 

  • Ho, T. K. (1995). ‘Random decision forests’, in Proceedings of 3rd International Conference on Document Analysis and Recognition. Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 278–282 vol.1. https://doi.org/10.1109/ICDAR.1995.598994.

  • Kleinbaum, D. G., et al. (2002). Logistic regression. New York: Springer-Verlag.

    Google Scholar 

  • Kohavi, R. (1995). ‘A study of cross-validation and bootstrap for accuracy estimation and model selection’, in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. (IJCAI’95), pp. 1137–1143.

  • Lan, M., et al. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE transactions on pattern analysis and machine intelligence, 31(4), 721–735. https://doi.org/10.1109/TPAMI.2008.110.

    Article  PubMed  Google Scholar 

  • Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1), 161–205. https://doi.org/10.1007/s10994-005-0466-3.

    Article  Google Scholar 

  • Langley, P., And, W. I., & Thompson, K. (1992). ‘An analysis of Bayesian classifiers’, in Proceedings of the tenth national conference on Artificial intelligence. San Jose, California: AAAI Press (AAAI’92), pp. 223–228.

  • Läubli, S. (2020). ‘Machine Translation for Professional Translators’. https://doi.org/10.5167/UZH-193466.

  • Liu, B. (2012). Sentiment analysis and opinion mining. Morgan & Claypool Publishers.

  • Liu, B. (2020). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. 2nd edition. Cambridge University Press.

  • McCallum, A., & Nigam, K. (1998). ‘A Comparison of Event Models for Naive Bayes Text Classification’, in Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41–48. Available at: http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf (Accessed: 25 May 2021).

  • Meier, T. et al. (2019). ‘“LIWC auf Deutsch”: The Development, Psychometrics, and Introduction of DE- LIWC2015’. PsyArXiv. https://doi.org/10.31234/osf.io/uq8zt.

  • Mello, R. F. et al. (2021). ‘Towards Automatic Content Analysis of Rhetorical Structure in Brazilian College Entrance Essays’, in International Conference on Artificial Intelligence in Education. Springer, pp. 162–167.

  • Midhun, M. E., Nair, S. R., Prabhakar, V. N., & Kumar, S. S. (2014). ‘Deep model for classification of hyperspectral image using restricted boltzmann machine’, in Proceedings of the 2014 international conference on interdisciplinary advances in applied computing (pp. 1–7).

  • Noether, G. E. (1981). Why Kendall Tau? Teaching Statistics, 3(2), 41–43. https://doi.org/10.1111/j.1467-9639.1981.tb00422.x.

    Article  Google Scholar 

  • Pennebaker, J. W. (2013). The Secret Life of Pronouns: What Our Words Say About Us. Reprint edition. New York: Bloomsbury Publishing.

  • Pennebaker, J. W. et al. (2015). ‘The Development and Psychometric Properties of LIWC2015’. Available at: https://repositories.lib.utexas.edu/handle/2152/31333 (Accessed: 23 May 2021).

  • Pettijohn, T. F., & Sacco, D. F. (2009). The Language of lyrics: an analysis of Popular Billboard Songs Across Conditions of Social and economic threat. Journal of Language and Social Psychology, 28(3), 297–311. https://doi.org/10.1177/0261927X09335259.

    Article  Google Scholar 

  • del Salas-Zárate, P., M., et al. (2014). A study on LIWC categories for opinion mining in spanish reviews. Journal of Information Science, 40(6), 749–760. https://doi.org/10.1177/0165551514547842.

    Article  Google Scholar 

  • Platt, J. (1998). ‘Fast Training of Support Vector Machines Using Sequential Minimal Optimization’. Available at: https://www.microsoft.com/en-us/research/publication/fast-training-of-support-vector-machines-using-sequential-minimal-optimization/ (Accessed: 8 June 2021).

  • Pranckevičius, T., & Marcinkevičius, V. (2016). November). Application of logistic regression with part-of-the-speech tagging for multi-class text classification. 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE) (pp. 1–5). IEEE.

  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251.

    Article  Google Scholar 

  • Quinlan, J. R. (2014). C4.5: programs for machine learning. Elsevier.

  • Rodrigues, R. G. et al. (2017). Inferência de idade utilizando o LIWC: identificando potenciais predadores sexuais. Anais do VI brazilian workshop on Social Network Analysis and Mining. BraSNAM 2017. São Paulo, Brazil: SBC.

    Google Scholar 

  • Rude, S. S., Gortner, E. M., & Pennebaker, J. W. (2004). Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18(8), 1121–1133. https://doi.org/10.1080/02699930441000030.

    Article  Google Scholar 

  • Santos, R. et al. (2016). ‘Evaluating the importance of Web comments through metrics extraction and opinion mining’, in 2016 35th International Conference of the Chilean Computer Science Society (SCCC). 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–11. https://doi.org/10.1109/SCCC.2016.7836039.

  • Schler, J. et al. (2006). ‘Effects of Age and Gender on Blogging.’, in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI, pp. 199–205. Available at: http://dblp.uni-trier.de/db/conf/aaaiss/aaaiss2006-3.html#SchlerKAP06 (Accessed: 23 May 2021).

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.

    Article  Google Scholar 

  • Sender, G., Carvalho, F., & Guedes, G. (2021). The happy level: a New Approach to measure happiness at work using mixed methods. International Journal of Qualitative Methods, 20, 16094069211002412. https://doi.org/10.1177/16094069211002413.

    Article  Google Scholar 

  • Shibata, D. et al. (2016). ‘Detecting Japanese Patients with Alzheimer’s Disease based on Word Category Frequencies’, in Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 78–85. Available at: https://www.aclweb.org/anthology/W16-4211 (Accessed: 23 May 2021).

  • Silva, M. J., Carvalho, P., & Sarmento, L. (2012). ‘Building a sentiment lexicon for social judgement mining’, in Proceedings of the 10th international conference on Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer-Verlag (PROPOR’12), pp. 218–228. https://doi.org/10.1007/978-3-642-28885-2_25.

  • Souza, M. et al. (2011). ‘Construction of a Portuguese Opinion Lexicon from multiple resources’, in 8th Brazilian Symposium in Information and Human Language Technology. STIL, Mato Grosso, Brazil.

  • Svetnik, V., et al. (2003). Random Forest: a classification and regression Tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958. https://doi.org/10.1021/ci034160g.

    Article  CAS  PubMed  Google Scholar 

  • Tang, C., & Guo, L. (2015). Digging for gold with a simple tool: validating text mining in studying electronic word-of-mouth (eWOM) communication. Marketing Letters, 26(1), 67–80. https://doi.org/10.1007/s11002-013-9268-8.

    Article  Google Scholar 

  • Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and Computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676.

    Article  Google Scholar 

  • Wang, S., & Manning, C. D. (2012). ‘Baselines and bigrams: simple, good sentiment and topic classification’, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. USA: Association for Computational Linguistics (ACL ’12), pp. 90–94.

  • Wissen, L. (2017). van and Boot, P. ‘An Electronic Translation of the LIWC Dictionary into Dutch’. Available at: https://pure.knaw.nl/portal/en/publications/an-electronic-translation-of-the-liwc-dictionary-into-dutch (Accessed: 23 May 2021).

  • Yin, Y., et al. (2019). A Lexical Resource-Constrained Topic Model for Word Relatedness. Ieee Access : Practical Innovations, Open Solutions, 7, 55261–55268. https://doi.org/10.1109/ACCESS.2019.2909104.

    Article  Google Scholar 

  • Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1), 43–52. https://doi.org/10.1007/s13042-010-0001-0.

    Article  Google Scholar 

Download references

Acknowledgements

Authors acknowledge CAPES by financial support. This work was development with financial support of ‘Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)’ by financial code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gustavo Guedes.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carvalho, F., Junior, F.P., Ogasawara, E. et al. Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015). Lang Resources & Evaluation 58, 203–222 (2024). https://doi.org/10.1007/s10579-023-09647-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09647-2

Keywords

Navigation