skip to main content
10.1145/2464464.2464483acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
research-article

Content-based similarity measures of weblog authors

Published: 02 May 2013 Publication History

Abstract

With recent research interest in the confounding roles of homophily and contagion in studies of social influence, there is a strong need for reliable content-based measures of the similarity between people. In this paper, we investigate the use of text similarity measures as a way of predicting the similarity of prolific weblog authors. We describe a novel method of collecting human judgments of overall similarity between two authors, as well as demographic, political, cultural, religious, values, hobbies/interests, personality, and writing style similarity. We then apply a range of automated textual similarity measures based on word frequency counts, and calculate their statistical correlation with human judgments. Our findings indicate that commonly used text similarity measures do not correlate well with human judgments of author similarity. However, various measures that pay special attention to personal pronouns and their context correlate significantly with different facets of similarity.

References

[1]
Argamon, S., Dhawle, S., Koppel, M., and Pennebaker, J. Lexical predictors of personality type. In Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of North America (2005).
[2]
Burton, K., Java, A., and Soboroff, I. The icwsm 2009 spinn3r dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media, ICWSM 2009 (San Jose, CA, 2009).
[3]
Christakis, N., and Fowler, J. The spread of obesity in a large social network over 32 years. New England Journal of Medicine 357, 4 (2007), 370--379.
[4]
Christakis, N., and Fowler, J. The collective dynamics of smoking in a large social network. New England Journal of Medicine 358, 21 (2008), 2249--2258.
[5]
Cohn, M., Mehl, M., and Pennebaker, J. Linguistic markers of psychological change surrounding september 11, 2001. Psychological Science 15, 10 (2004), 687--693.
[6]
Fast, L., and Funder, D. Personality as manifest in word use: Correlations with self-report, acquaintance report, and behavior. Journal of Personality and Social Psychology 94, 2 (2008), 334.
[7]
Goldberg, L. An alternative "description of personality": the big-five factor structure. Journal of Personality and Social Psychology; Journal of Personality and Social Psychology 59, 6 (1990), 1216--1229.
[8]
Gordon, A., and Swanson, R. Identifying personal stories in millions of weblog entries. In Proceedings of the Third International Conference on Weblogs and Social Media, Data Challenge Workshop, ICWSM 2009 (San Jose, CA, 2009).
[9]
Holmes, D. Authorship attribution. Computers and the Humanities 28, 2 (1994), 87--106.
[10]
Koppel, M., Argamon, S., and Shimoni, A. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17, 4 (2002), 401--412.
[11]
Lerman, K., and Ghosh, R. Information contagion: An empirical study of the spread of news on digg and twitter social networks. In Proceedings of the Fourth International Conference on Weblogs and Social Media, ICWSM 2010 (Washington, DC, 2010).
[12]
Lyons, R. The spread of evidence-poor medicine via flawed social-network analysis. Statistics, Politics, and Policy 2, 1 (2011), Article 2.
[13]
Mairesse, F., Walker, M., Mehl, M., and Moore, R. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1 (2007), 457--500.
[14]
Nowson, S., and Oberlander, J. The identity of bloggers: Openness and gender in personal weblogs. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (2006).
[15]
Oberlander, J., and Nowson, S. Whose thumb is it anyway?: classifying author personality from weblog text. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics, COLING-ACL '06 (2006), 627--634.
[16]
Pennebaker, J., Francis, M., and Booth, R. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates (2001).
[17]
Pennebaker, J., and King, L. Linguistic styles: language use as an individual difference. Journal of personality and social psychology 77, 6 (1999), 1296--1312.
[18]
Pennebaker, J., and Lay, T. Language use and personality during crises: Analyses of mayor rudolph giuliani's press conferences. Journal of Research in Personality 36, 3 (2002), 271--282.
[19]
Pennebaker, J., and Stone, L. Words of wisdom: Language use over the life span. Journal of personality and social psychology 85, 2 (2003), 291--301.
[20]
Rosenberg, S. Say Everything: How blogging began, what it's becoming, and why it matters. Crown Publishers, New York, 2009.
[21]
Rosenthal, S., and McKeown, K. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011), 763--772.
[22]
Rude, S., Gortner, E., and Pennebaker, J. Language use of depressed and depression-vulnerable college students. Cognition & Emotion 18, 8 (2004), 1121--1133.
[23]
Shalizi, C., and Thomas, A. Homophily and contagion are generically confounded in observational social network studies. Sociological Methods & Research 40, 2 (2011), 211--239.
[24]
Yarkoni, T. Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of research in personality 44, 3 (2010), 363--373.

Cited By

View all
  • (2017)Portuguese personal story analysis and detection in blogsProceedings of the International Conference on Web Intelligence10.1145/3106426.3106517(709-715)Online publication date: 23-Aug-2017
  • (2017)Affinity Groups: A Linguistic Analysis for Social Network Groups IdentificationSocial Informatics10.1007/978-3-319-67256-4_21(265-276)Online publication date: 2-Sep-2017
  • (2015)Insights on Privacy and Ethics from the Web's Most Prolific StorytellersProceedings of the ACM Web Science Conference10.1145/2786451.2786474(1-10)Online publication date: 28-Jun-2015
  • Show More Cited By

Index Terms

  1. Content-based similarity measures of weblog authors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WebSci '13: Proceedings of the 5th Annual ACM Web Science Conference
    May 2013
    481 pages
    ISBN:9781450318891
    DOI:10.1145/2464464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 May 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. personal pronouns
    2. similarity measures
    3. weblogs

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WebSci '13
    Sponsor:
    WebSci '13: Web Science 2013
    May 2 - 4, 2013
    Paris, France

    Acceptance Rates

    Overall Acceptance Rate 245 of 933 submissions, 26%

    Upcoming Conference

    Websci '25
    17th ACM Web Science Conference
    May 20 - 24, 2025
    New Brunswick , NJ , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Portuguese personal story analysis and detection in blogsProceedings of the International Conference on Web Intelligence10.1145/3106426.3106517(709-715)Online publication date: 23-Aug-2017
    • (2017)Affinity Groups: A Linguistic Analysis for Social Network Groups IdentificationSocial Informatics10.1007/978-3-319-67256-4_21(265-276)Online publication date: 2-Sep-2017
    • (2015)Insights on Privacy and Ethics from the Web's Most Prolific StorytellersProceedings of the ACM Web Science Conference10.1145/2786451.2786474(1-10)Online publication date: 28-Jun-2015
    • (2014)Geographical and organizational distances in enterprise crowdfundingProceedings of the 17th ACM conference on Computer supported cooperative work & social computing10.1145/2531602.2531716(778-789)Online publication date: 15-Feb-2014

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media