Abstract
Massive user generated social media text posits new opportunities as well as challenges for psycholinguistic analysis to understand individual differences such as personality. Traditional off-the-shelf NLP (Natural Language Processing) tools perform poorly when analyzing text from social media because of the frequent out-of-vocabulary (OOV) words usage. Existing dictionary-based, closed-vocabulary approach and data-driven, open-vocabulary approach are both limited in handling OOV words. This research designs an OOV-aware curation process with a specific focus on OOV words. Following a design science research process model, the curation process is designed through four design cycles. The curation process includes a hybrid approach that integrates closed-vocabulary dictionary with expanded OOV categories and open-vocabulary approach with normalized OOV words. The curation process would enable psycholinguistic researchers and practitioners to exploit more psycholinguistic cues for tasks similar to personality predictions using social media text.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
spaCy https://spacy.io/; happierfuntokenizing https://github.com/dlatk/happierfuntokenizing; ekphrasis https://github.com/cbaziotis/ekphrasis [35].
References
John, O.P., Angleitner, A., Ostendorf, F.: The lexical approach to personality: a historical review of trait taxonomic research. Eur. J. Pers. 2, 171–203 (1988). https://doi.org/10.1002/per.2410020302
Pennebaker, J.W., King, L.A.: Linguistic styles: language use as an individual difference. J. Pers. Soc. Psychol. 77, 1296–1312 (1999). https://doi.org/10.1037/0022-3514.77.6.1296
Krippendorff, K.: Content Analysis. https://us.sagepub.com/en-us/nam/content-analysis/book258450. Accessed 01 Dec 2019
Kern, M.L., et al.: Gaining insights from social media language: Methodologies and challenges. Psychol. Methods 21, 507–525 (2016). https://doi.org/10.1037/met0000091
Youyou, W., Kosinski, M., Stillwell, D.: Computer-based personality judgments are more accurate than those made by humans. PNAS 112, 1036–1040 (2015). https://doi.org/10.1073/pnas.1418680112
Lambiotte, R., Kosinski, M.: Tracking the digital footprints of personality. Proc. IEEE 102, 1934–1939 (2014). https://doi.org/10.1109/JPROC.2014.2359054
Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Int. Res. 30, 457–500 (2007)
Golbeck, J., Robles, C., Turner, K.: Predicting personality with social media. In: Extended Abstracts on Human Factors in Computing Systems, CHI 2011, pp. 253–262. ACM, New York (2011). https://doi.org/10.1145/1979742.1979614
Fast, E., Chen, B., Bernstein, M.S.: Empath: understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 4647–4657. ACM, New York (2016). https://doi.org/10.1145/2858036.2858535
Sarker, A.: A customizable pipeline for social media text normalization. Soc. Netw. Anal. Min. 7(1), 1–13 (2017). https://doi.org/10.1007/s13278-017-0464-z
Yarkoni, T.: Personality in 100,000 words: a large-scale analysis of personality and word use among bloggers. J. Res. Pers. 44, 363–373 (2010). https://doi.org/10.1016/j.jrp.2010.04.001
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013). https://doi.org/10.1371/journal.pone.0073791
Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. 4, 5:1–5:27 (2013). https://doi.org/10.1145/2414425.2414430
Azucar, D., Marengo, D., Settanni, M.: Predicting the Big 5 personality traits from digital footprints on social media: a meta-analysis. Pers. Individ. Differ. 124, 150–159 (2018). https://doi.org/10.1016/j.paid.2017.12.018
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: RANLP (2013)
Kramer, A.D.I., Rodden, K.: Word usage and posting behaviors: modeling blogs with unobtrusive data collection methods. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1125–1128. ACM, New York (2008). https://doi.org/10.1145/1357054.1357230
Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a #twitter. Presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies June (2011)
Arnoux, P.-H., Xu, A., Boyette, N., Mahmud, J., Akkiraju, R., Sinha, V.: 25 Tweets to Know You: A New Model to Predict Personality with Social Media. arXiv:1704.05513 [cs] (2017)
Oberlander, J., Nowson, S.: Whose thumb is it anyway? Classifying author personality from weblog text. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 627–634. Association for Computational Linguistics, Stroudsburg (2006)
Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 189–196. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)
Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 71–76. Association for Computational Linguistics (2011)
Schwartz, H.A., Ungar, L.H.: Data-driven content analysis of social media: a systematic overview of automated methods. Ann. Am. Acad. Polit. Soc. Sci. 659, 78–94 (2015). https://doi.org/10.1177/0002716215569197
Farnadi, G., et al.: Computational personality recognition in social media. User Model. User-Adap. Interact. 26, 109–142 (2016). https://doi.org/10.1007/s11257-016-9171-0
Settanni, M., Marengo, D.: Sharing feelings online: studying emotional well-being via automated text analysis of Facebook posts. Front. Psychol. 6 (2015). https://doi.org/10.3389/fpsyg.2015.01045
Iacobelli, F., Gill, A.J., Nowson, S., Oberlander, J.: Large scale personality classification of bloggers. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, pp. 568–577. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_71
Schwartz, H.A., et al.: Toward personality insights from language exploration in social media. In: 2013 AAAI Spring Symposium Series (2013)
Vaishnavi, V.K., Kuechler Jr., W.: Design Science Research Methods and Patterns: Innovating Information and Communication Technology. Auerbach Publications, USA (2007)
Funder, D.C.: Accurate personality judgment. Curr. Dir. Psychol. Sci. 21, 177–182 (2012). https://doi.org/10.1177/0963721412445309
Goldberg, L.R.: An alternative “description of personality”: The Big-Five factor structure. J. Pers. Soc. Psychol. 59, 1216–1229 (1990). https://doi.org/10.1037/0022-3514.59.6.1216
Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. PNAS 110, 5802–5805 (2013). https://doi.org/10.1073/pnas.1218772110
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88. Association for Computational Linguistics, Los Angeles (2010)
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh (2011)
España-Bonet, C., Costa-jussà, M.R.: Hybrid machine translation overview. In: Costa-jussà, M.R.R., Rapp, R., Lambert, P., Eberle, K., Banchs, R.E.E., Babych, B. (eds.) Hybrid Approaches to Machine Translation. TANLP, pp. 1–24. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21311-8_1
Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., Chen, Y.: Using moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the Third Workshop on Statistical Machine Translation (2008). https://doi.org/10.3115/1626394.1626422
Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2126
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, K., Li, Y. (2020). An OOV-Aware Curation Process for Psycholinguistic Analysis of Social Media Text - A Hybrid Approach. In: Hofmann, S., Müller, O., Rossi, M. (eds) Designing for Digital Transformation. Co-Creating Services with Citizens and Industry. DESRIST 2020. Lecture Notes in Computer Science(), vol 12388. Springer, Cham. https://doi.org/10.1007/978-3-030-64823-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-64823-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64822-0
Online ISBN: 978-3-030-64823-7
eBook Packages: Computer ScienceComputer Science (R0)