Skip to main content

An OOV-Aware Curation Process for Psycholinguistic Analysis of Social Media Text - A Hybrid Approach

  • Conference paper
  • First Online:
  • 2314 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12388))

Abstract

Massive user generated social media text posits new opportunities as well as challenges for psycholinguistic analysis to understand individual differences such as personality. Traditional off-the-shelf NLP (Natural Language Processing) tools perform poorly when analyzing text from social media because of the frequent out-of-vocabulary (OOV) words usage. Existing dictionary-based, closed-vocabulary approach and data-driven, open-vocabulary approach are both limited in handling OOV words. This research designs an OOV-aware curation process with a specific focus on OOV words. Following a design science research process model, the curation process is designed through four design cycles. The curation process includes a hybrid approach that integrates closed-vocabulary dictionary with expanded OOV categories and open-vocabulary approach with normalized OOV words. The curation process would enable psycholinguistic researchers and practitioners to exploit more psycholinguistic cues for tasks similar to personality predictions using social media text.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://mypersonality.org/.

  2. 2.

    spaCy https://spacy.io/; happierfuntokenizing https://github.com/dlatk/happierfuntokenizing; ekphrasis https://github.com/cbaziotis/ekphrasis [35].

References

  1. John, O.P., Angleitner, A., Ostendorf, F.: The lexical approach to personality: a historical review of trait taxonomic research. Eur. J. Pers. 2, 171–203 (1988). https://doi.org/10.1002/per.2410020302

    Article  Google Scholar 

  2. Pennebaker, J.W., King, L.A.: Linguistic styles: language use as an individual difference. J. Pers. Soc. Psychol. 77, 1296–1312 (1999). https://doi.org/10.1037/0022-3514.77.6.1296

    Article  Google Scholar 

  3. Krippendorff, K.: Content Analysis. https://us.sagepub.com/en-us/nam/content-analysis/book258450. Accessed 01 Dec 2019

  4. Kern, M.L., et al.: Gaining insights from social media language: Methodologies and challenges. Psychol. Methods 21, 507–525 (2016). https://doi.org/10.1037/met0000091

    Article  Google Scholar 

  5. Youyou, W., Kosinski, M., Stillwell, D.: Computer-based personality judgments are more accurate than those made by humans. PNAS 112, 1036–1040 (2015). https://doi.org/10.1073/pnas.1418680112

    Article  Google Scholar 

  6. Lambiotte, R., Kosinski, M.: Tracking the digital footprints of personality. Proc. IEEE 102, 1934–1939 (2014). https://doi.org/10.1109/JPROC.2014.2359054

    Article  Google Scholar 

  7. Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Int. Res. 30, 457–500 (2007)

    MATH  Google Scholar 

  8. Golbeck, J., Robles, C., Turner, K.: Predicting personality with social media. In: Extended Abstracts on Human Factors in Computing Systems, CHI 2011, pp. 253–262. ACM, New York (2011). https://doi.org/10.1145/1979742.1979614

  9. Fast, E., Chen, B., Bernstein, M.S.: Empath: understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 4647–4657. ACM, New York (2016). https://doi.org/10.1145/2858036.2858535

  10. Sarker, A.: A customizable pipeline for social media text normalization. Soc. Netw. Anal. Min. 7(1), 1–13 (2017). https://doi.org/10.1007/s13278-017-0464-z

    Article  MathSciNet  Google Scholar 

  11. Yarkoni, T.: Personality in 100,000 words: a large-scale analysis of personality and word use among bloggers. J. Res. Pers. 44, 363–373 (2010). https://doi.org/10.1016/j.jrp.2010.04.001

    Article  Google Scholar 

  12. Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013). https://doi.org/10.1371/journal.pone.0073791

    Article  Google Scholar 

  13. Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. 4, 5:1–5:27 (2013). https://doi.org/10.1145/2414425.2414430

  14. Azucar, D., Marengo, D., Settanni, M.: Predicting the Big 5 personality traits from digital footprints on social media: a meta-analysis. Pers. Individ. Differ. 124, 150–159 (2018). https://doi.org/10.1016/j.paid.2017.12.018

    Article  Google Scholar 

  15. Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: RANLP (2013)

    Google Scholar 

  16. Kramer, A.D.I., Rodden, K.: Word usage and posting behaviors: modeling blogs with unobtrusive data collection methods. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1125–1128. ACM, New York (2008). https://doi.org/10.1145/1357054.1357230

  17. Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a #twitter. Presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies June (2011)

    Google Scholar 

  18. Arnoux, P.-H., Xu, A., Boyette, N., Mahmud, J., Akkiraju, R., Sinha, V.: 25 Tweets to Know You: A New Model to Predict Personality with Social Media. arXiv:1704.05513 [cs] (2017)

  19. Oberlander, J., Nowson, S.: Whose thumb is it anyway? Classifying author personality from weblog text. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 627–634. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  20. Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 189–196. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)

    Google Scholar 

  21. Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 71–76. Association for Computational Linguistics (2011)

    Google Scholar 

  22. Schwartz, H.A., Ungar, L.H.: Data-driven content analysis of social media: a systematic overview of automated methods. Ann. Am. Acad. Polit. Soc. Sci. 659, 78–94 (2015). https://doi.org/10.1177/0002716215569197

    Article  Google Scholar 

  23. Farnadi, G., et al.: Computational personality recognition in social media. User Model. User-Adap. Interact. 26, 109–142 (2016). https://doi.org/10.1007/s11257-016-9171-0

    Article  Google Scholar 

  24. Settanni, M., Marengo, D.: Sharing feelings online: studying emotional well-being via automated text analysis of Facebook posts. Front. Psychol. 6 (2015). https://doi.org/10.3389/fpsyg.2015.01045

  25. Iacobelli, F., Gill, A.J., Nowson, S., Oberlander, J.: Large scale personality classification of bloggers. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, pp. 568–577. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_71

    Chapter  Google Scholar 

  26. Schwartz, H.A., et al.: Toward personality insights from language exploration in social media. In: 2013 AAAI Spring Symposium Series (2013)

    Google Scholar 

  27. Vaishnavi, V.K., Kuechler Jr., W.: Design Science Research Methods and Patterns: Innovating Information and Communication Technology. Auerbach Publications, USA (2007)

    Google Scholar 

  28. Funder, D.C.: Accurate personality judgment. Curr. Dir. Psychol. Sci. 21, 177–182 (2012). https://doi.org/10.1177/0963721412445309

    Article  Google Scholar 

  29. Goldberg, L.R.: An alternative “description of personality”: The Big-Five factor structure. J. Pers. Soc. Psychol. 59, 1216–1229 (1990). https://doi.org/10.1037/0022-3514.59.6.1216

    Article  Google Scholar 

  30. Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. PNAS 110, 5802–5805 (2013). https://doi.org/10.1073/pnas.1218772110

    Article  Google Scholar 

  31. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88. Association for Computational Linguistics, Los Angeles (2010)

    Google Scholar 

  32. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh (2011)

    Google Scholar 

  33. España-Bonet, C., Costa-jussà, M.R.: Hybrid machine translation overview. In: Costa-jussà, M.R.R., Rapp, R., Lambert, P., Eberle, K., Banchs, R.E.E., Babych, B. (eds.) Hybrid Approaches to Machine Translation. TANLP, pp. 1–24. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21311-8_1

    Chapter  Google Scholar 

  34. Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., Chen, Y.: Using moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the Third Workshop on Statistical Machine Translation (2008). https://doi.org/10.3115/1626394.1626422

  35. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2126

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, K., Li, Y. (2020). An OOV-Aware Curation Process for Psycholinguistic Analysis of Social Media Text - A Hybrid Approach. In: Hofmann, S., Müller, O., Rossi, M. (eds) Designing for Digital Transformation. Co-Creating Services with Citizens and Industry. DESRIST 2020. Lecture Notes in Computer Science(), vol 12388. Springer, Cham. https://doi.org/10.1007/978-3-030-64823-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64823-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64822-0

  • Online ISBN: 978-3-030-64823-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics