Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)

Carvalho, Flavio; Junior, Fabio Paschoal; Ogasawara, Eduardo; Ferrari, Lilian; Guedes, Gustavo

doi:10.1007/s10579-023-09647-2

Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)

Original Paper
Published: 03 May 2023

Volume 58, pages 203–222, (2024)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

289 Accesses
Explore all metrics

Abstract

Text psycholinguistic features are a valuable source for various research topics since they are used to obtain psychological, social, and linguistic aspects from written texts using dictionary files. These files are structured in categories, which are defined as groups of dictionary words that tap a particular domain (e.g., negative emotion words). The Linguistic Inquiry Word Count (LIWC) is a vastly used and versatile computer-based language analysis tool designed for text psycholinguistic analysis. The most recent version of the default English dictionary is LIWC2015, as it was released with the 2015 version of the LIWC software. The literature has recently introduced the latest Brazilian Portuguese LIWC dictionary (BP-LIWC2015), developed with the same categories as the LIWC 2015 English dictionary. However, the literature has also reported the need to evaluate BP-LIWC2015. In this scenario, this work investigates three questions: (i) Since LIWC2015 shows consistent improvements over the English dictionary developed in 2007 (LIWC2007), does BP-LIWC2015 achieves better text classification results than the older Brazilian Portuguese dictionary (BP-LIWC2007)? (ii) What is the equivalence between BP-LIWC2015 and BP-LIWC2007 with LIWC2015? (iii) Are there significant differences between Brazilian Portuguese dictionaries? To answer these questions, we conducted text classification experiments with four datasets and seven classification algorithms to compare the two Brazilian Portuguese LIWC dictionaries reported in the literature (i.e., 2007 and 2015). Second, we used a bilingual Portuguese-English scientific news collection to analyze the correlation between LIWC2015 and Brazilian Portuguese LIWC dictionaries. The results indicate that BP-LIWC2015 outperforms the older version in Brazilian Portuguese text classification. Finally, we found a more significant correlation between BP-LIWC2015 and the original English dictionary than the older version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning and Super-Hybrid Textual Feature Based Multi-category Thematic Classifier for Punjabi Poetry

Lightme: analysing language in internet support groups for mental health

Article 13 October 2020

MegaLite-2: An Extended Bilingual Comparative Literary Corpus

Notes

The median was chosen as the dispersion measure since the values found for most categories corresponded to a non-normal distribution according to the D’Agostino-Pearson test (D’Agostino & Belanger, 1990).
To clarify the procedure conducted, revisit Table 1. After the n texts have been processed, each column’s max, min, and median values (i.e., category) are found. Thus, considering the five values presented in Table 1, the Function words category has a median equal to 21.73.

References

Aggarwal, C. C. (Ed.). (2011). Social Network Data Analytics. Boston, MA: Springer US. https://doi.org/10.1007/978-1-4419-8462-3.
Book Google Scholar
Aires, R. et al. (2004). Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users. needs? report. Available at: https://comum.rcaap.pt/handle/10400.26/363?mode=full (Accessed: 24 May 2021).
Al-Rfou, R., Perozzi, B., & Skiena, S. (2014). ‘Polyglot: Distributed Word Representations for Multilingual NLP’, arXiv:1307.1662 [cs]. Available at: http://arxiv.org/abs/1307.1662 (Accessed: 10 June 2021).
Aziz, W., & Specia, L. (2011). ‘Fully automatic compilation of a Portuguese-English parallel corpus for statistical machine translation’, in Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology. STIL, Cuiabá, Brazil.
Balage Filho, P. P., Pardo, T. A. S., & Aluisio, S. M. (2013). ‘An evaluation of the Brazilian Portuguese LIWC Dictionary for sentiment analysis’, Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (STIL), pp. 215–219.
Barbosa, A. et al. (2021). ‘The impact of automatic text translation on classification of online discussions for social and cognitive presences’, in LAK21: 11th International Learning Analytics and Knowledge Conference. LAK21: 11th International Learning Analytics and Knowledge Conference, Irvine CA USA: ACM, pp. 77–87. https://doi.org/10.1145/3448139.3448147.
Becker, K., & Tumitan, D. (2013). Introdução à mineração de opiniões: Conceitos, aplicações e desafios. Simpósio brasileiro de banco de dados, 75, 27–52.
Google Scholar
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798–1828.
Article PubMed Google Scholar
Bryman, A., & Bell, E. (2015). Business Research Methods. 4^a edição. Cambridge, United Kingdom; New York, NY, United States of America: OUP Oxford.
Calvo, R. A., & D’Mello, S. (2010). Affect detection: an Interdisciplinary Review of Models, Methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37. https://doi.org/10.1109/T-AFFC.2010.1.
Article Google Scholar
Cambria, E., et al. (2013). New Avenues in Opinion Mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. https://doi.org/10.1109/MIS.2013.30.
Article Google Scholar
Carvalho, F. et al. (2019). ‘Evaluating the Brazilian Portuguese version of the 2015 LIWC Lexicon with sentiment analysis in social networks’, in Proceedings of the VIII Brazilian Workshop on Social Network Analysis and Mining. (BraSNAM), Belém, PA, Brazil: SBC, pp. 24–34. https://doi.org/10.5753/brasnam.2019.6545.
Carvalho, F., Santos, G., & Guedes, G. P. (2018). ‘AffectPT-br: an Affective Lexicon based on LIWC 2015’, in 37th International Conference of the Chilean Computer Science Society. (SCCC), Santiago, Chile, pp. 1–5. https://doi.org/10.1109/SCCC.2018.8705251.
Dandannavar, P. S., Mangalwede, S. R., & Deshpande, S. B. (2020). ‘Emoticons and Their Effects on Sentiment Analysis of Twitter Data’, in Haldorai, A. et al. (eds) EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing. Cham: Springer International Publishing (EAI/Springer Innovations in Communication and Computing), pp. 191–201. https://doi.org/10.1007/978-3-030-19562-5_19.
Dudău, D. P., & Sava, F. A. (2020). ‘The development and validation of the Romanian version of Linguistic Inquiry and Word Count 2015 (Ro-LIWC2015)’, Current Psychology. https://doi.org/10.1007/s12144-020-00872-4.
Eichstaedt, J. C., Kern, M. L., Yaden, D. B., Schwartz, H. A., Giorgi, S., Park, G., Hagan, C. A., Tobolsky, V. A., Smith, L. K., Buffone, A., & Iwry, J. (2021). Closed-and open-vocabulary approaches to text analysis: a review, quantitative comparison, and recommendations. Psychological Methods, 26(4), 398–427. https://doi.org/10.1037/met0000349.
Article PubMed Google Scholar
Falaki, H. et al. (2010). ‘Diversity in smartphone usage’, in Proceedings of the 8th international conference on Mobile systems, applications, and services. New York, NY, USA: Association for Computing Machinery (MobiSys ’10), pp. 179–194. https://doi.org/10.1145/1814433.1814453.
Fersini, E., Pozzi, F. A., & Messina, E. (2015). ‘Detecting irony and sarcasm in microblogs: The role of expressive signals and ensemble classifiers’, in Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics. (DSAA), pp. 1–8. https://doi.org/10.1109/DSAA.2015.7344888.
Flake, G. W., & Lawrence, S. (2002). Efficient SVM regression training with SMO. Machine Learning, 46(1), 271–290. https://doi.org/10.1023/A:1012474916001.
Article Google Scholar
Fornaciari, T., et al. (2020). Fake opinion detection: how similar are crowdsourced datasets to real data? Language Resources and Evaluation, 54(4), 1019–1058. https://doi.org/10.1007/s10579-020-09486-5.
Article Google Scholar
Fukunaga, K. (1990). Introduction to statistical pattern recognition. Academic Press, second edition.
Gabrilovich, E., & Markovitch, S. (2004). ‘Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5’, in Proceedings of the twenty-first international conference on Machine learning. New York, NY, USA: Association for Computing Machinery (ICML ’04), p. 41. https://doi.org/10.1145/1015330.1015388.
Grimmer, J., & Stewart, B. M. (2013). Text as data: the Promise and Pitfalls of Automatic Content Analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028.
Article Google Scholar
Hall, M., et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278.
Article Google Scholar
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques. 3rd edition. Haryana, India; Burlington, MA: Morgan Kaufmann.
Hernández Farías, D. I., Ortega-Mendoza, R. M., & Montes-y-Gómez, M. (2019). Exploring the Use of Psycholinguistic Information in author profiling. In J. A. Carrasco-Ochoa, et al. (Eds.), Pattern recognition (pp. 411–421). Cham: Springer International Publishing. Lecture Notes in Computer Science10.1007/978-3-030-21077-9_38.
Chapter Google Scholar
Ho, T. K. (1995). ‘Random decision forests’, in Proceedings of 3rd International Conference on Document Analysis and Recognition. Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 278–282 vol.1. https://doi.org/10.1109/ICDAR.1995.598994.
Kleinbaum, D. G., et al. (2002). Logistic regression. New York: Springer-Verlag.
Google Scholar
Kohavi, R. (1995). ‘A study of cross-validation and bootstrap for accuracy estimation and model selection’, in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. (IJCAI’95), pp. 1137–1143.
Lan, M., et al. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE transactions on pattern analysis and machine intelligence, 31(4), 721–735. https://doi.org/10.1109/TPAMI.2008.110.
Article PubMed Google Scholar
Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1), 161–205. https://doi.org/10.1007/s10994-005-0466-3.
Article Google Scholar
Langley, P., And, W. I., & Thompson, K. (1992). ‘An analysis of Bayesian classifiers’, in Proceedings of the tenth national conference on Artificial intelligence. San Jose, California: AAAI Press (AAAI’92), pp. 223–228.
Läubli, S. (2020). ‘Machine Translation for Professional Translators’. https://doi.org/10.5167/UZH-193466.
Liu, B. (2012). Sentiment analysis and opinion mining. Morgan & Claypool Publishers.
Liu, B. (2020). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. 2nd edition. Cambridge University Press.
McCallum, A., & Nigam, K. (1998). ‘A Comparison of Event Models for Naive Bayes Text Classification’, in Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41–48. Available at: http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf (Accessed: 25 May 2021).
Meier, T. et al. (2019). ‘“LIWC auf Deutsch”: The Development, Psychometrics, and Introduction of DE- LIWC2015’. PsyArXiv. https://doi.org/10.31234/osf.io/uq8zt.
Mello, R. F. et al. (2021). ‘Towards Automatic Content Analysis of Rhetorical Structure in Brazilian College Entrance Essays’, in International Conference on Artificial Intelligence in Education. Springer, pp. 162–167.
Midhun, M. E., Nair, S. R., Prabhakar, V. N., & Kumar, S. S. (2014). ‘Deep model for classification of hyperspectral image using restricted boltzmann machine’, in Proceedings of the 2014 international conference on interdisciplinary advances in applied computing (pp. 1–7).
Noether, G. E. (1981). Why Kendall Tau? Teaching Statistics, 3(2), 41–43. https://doi.org/10.1111/j.1467-9639.1981.tb00422.x.
Article Google Scholar
Pennebaker, J. W. (2013). The Secret Life of Pronouns: What Our Words Say About Us. Reprint edition. New York: Bloomsbury Publishing.
Pennebaker, J. W. et al. (2015). ‘The Development and Psychometric Properties of LIWC2015’. Available at: https://repositories.lib.utexas.edu/handle/2152/31333 (Accessed: 23 May 2021).
Pettijohn, T. F., & Sacco, D. F. (2009). The Language of lyrics: an analysis of Popular Billboard Songs Across Conditions of Social and economic threat. Journal of Language and Social Psychology, 28(3), 297–311. https://doi.org/10.1177/0261927X09335259.
Article Google Scholar
del Salas-Zárate, P., M., et al. (2014). A study on LIWC categories for opinion mining in spanish reviews. Journal of Information Science, 40(6), 749–760. https://doi.org/10.1177/0165551514547842.
Article Google Scholar
Platt, J. (1998). ‘Fast Training of Support Vector Machines Using Sequential Minimal Optimization’. Available at: https://www.microsoft.com/en-us/research/publication/fast-training-of-support-vector-machines-using-sequential-minimal-optimization/ (Accessed: 8 June 2021).
Pranckevičius, T., & Marcinkevičius, V. (2016). November). Application of logistic regression with part-of-the-speech tagging for multi-class text classification. 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE) (pp. 1–5). IEEE.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251.
Article Google Scholar
Quinlan, J. R. (2014). C4.5: programs for machine learning. Elsevier.
Rodrigues, R. G. et al. (2017). Inferência de idade utilizando o LIWC: identificando potenciais predadores sexuais. Anais do VI brazilian workshop on Social Network Analysis and Mining. BraSNAM 2017. São Paulo, Brazil: SBC.
Google Scholar
Rude, S. S., Gortner, E. M., & Pennebaker, J. W. (2004). Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18(8), 1121–1133. https://doi.org/10.1080/02699930441000030.
Article Google Scholar
Santos, R. et al. (2016). ‘Evaluating the importance of Web comments through metrics extraction and opinion mining’, in 2016 35th International Conference of the Chilean Computer Science Society (SCCC). 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–11. https://doi.org/10.1109/SCCC.2016.7836039.
Schler, J. et al. (2006). ‘Effects of Age and Gender on Blogging.’, in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI, pp. 199–205. Available at: http://dblp.uni-trier.de/db/conf/aaaiss/aaaiss2006-3.html#SchlerKAP06 (Accessed: 23 May 2021).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.
Article Google Scholar
Sender, G., Carvalho, F., & Guedes, G. (2021). The happy level: a New Approach to measure happiness at work using mixed methods. International Journal of Qualitative Methods, 20, 16094069211002412. https://doi.org/10.1177/16094069211002413.
Article Google Scholar
Shibata, D. et al. (2016). ‘Detecting Japanese Patients with Alzheimer’s Disease based on Word Category Frequencies’, in Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 78–85. Available at: https://www.aclweb.org/anthology/W16-4211 (Accessed: 23 May 2021).
Silva, M. J., Carvalho, P., & Sarmento, L. (2012). ‘Building a sentiment lexicon for social judgement mining’, in Proceedings of the 10th international conference on Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer-Verlag (PROPOR’12), pp. 218–228. https://doi.org/10.1007/978-3-642-28885-2_25.
Souza, M. et al. (2011). ‘Construction of a Portuguese Opinion Lexicon from multiple resources’, in 8th Brazilian Symposium in Information and Human Language Technology. STIL, Mato Grosso, Brazil.
Svetnik, V., et al. (2003). Random Forest: a classification and regression Tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958. https://doi.org/10.1021/ci034160g.
Article CAS PubMed Google Scholar
Tang, C., & Guo, L. (2015). Digging for gold with a simple tool: validating text mining in studying electronic word-of-mouth (eWOM) communication. Marketing Letters, 26(1), 67–80. https://doi.org/10.1007/s11002-013-9268-8.
Article Google Scholar
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and Computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676.
Article Google Scholar
Wang, S., & Manning, C. D. (2012). ‘Baselines and bigrams: simple, good sentiment and topic classification’, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. USA: Association for Computational Linguistics (ACL ’12), pp. 90–94.
Wissen, L. (2017). van and Boot, P. ‘An Electronic Translation of the LIWC Dictionary into Dutch’. Available at: https://pure.knaw.nl/portal/en/publications/an-electronic-translation-of-the-liwc-dictionary-into-dutch (Accessed: 23 May 2021).
Yin, Y., et al. (2019). A Lexical Resource-Constrained Topic Model for Word Relatedness. Ieee Access : Practical Innovations, Open Solutions, 7, 55261–55268. https://doi.org/10.1109/ACCESS.2019.2909104.
Article Google Scholar
Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1), 43–52. https://doi.org/10.1007/s13042-010-0001-0.
Article Google Scholar

Download references

Acknowledgements

Authors acknowledge CAPES by financial support. This work was development with financial support of ‘Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)’ by financial code 001.

Author information

Authors and Affiliations

CEFET/RJ, Rio de Janeiro, Brazil
Flavio Carvalho, Fabio Paschoal Junior, Eduardo Ogasawara & Gustavo Guedes
UFRJ, Rio de Janeiro, Brazil
Lilian Ferrari

Authors

Flavio Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Paschoal Junior
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Ogasawara
View author publications
You can also search for this author in PubMed Google Scholar
Lilian Ferrari
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Guedes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gustavo Guedes.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Carvalho, F., Junior, F.P., Ogasawara, E. et al. Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015). Lang Resources & Evaluation 58, 203–222 (2024). https://doi.org/10.1007/s10579-023-09647-2

Download citation

Accepted: 14 February 2023
Published: 03 May 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10579-023-09647-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)

Abstract

Access this article

Similar content being viewed by others

Deep Learning and Super-Hybrid Textual Feature Based Multi-category Thematic Classifier for Punjabi Poetry

Lightme: analysing language in internet support groups for mental health

MegaLite-2: An Extended Bilingual Comparative Literary Corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluation of the Brazilian Portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015)

Abstract

Access this article

Similar content being viewed by others

Deep Learning and Super-Hybrid Textual Feature Based Multi-category Thematic Classifier for Punjabi Poetry

Lightme: analysing language in internet support groups for mental health

MegaLite-2: An Extended Bilingual Comparative Literary Corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation