Abstract
Social network users may wish to preserve their anonymity online by masking their identity and not using language associated with any particular demographics or personality. However, they have no control over the language in incoming communications. We show that linguistic cues in public comments directed at a user are sufficient for an accurate inference of that user’s gender, age, religion, diet, and even personality traits. Moreover, we show that directed communication is even more predictive of a user’s profile than the user’s own language. We then conduct a nuanced analysis of what types of social relationships are most predictive of users’ attributes, and propose new strategies on how individuals can modulate their online social relationships and incoming communications to preserve their anonymity.
Keywords
- Incoming Communications
- Top Departments
- Demographic Inference
- Linguistic Inquiry And Word Count (LIWC)
- Demographic Attributes
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Notes
- 1.
- 2.
We note that other diets are possible, such as kosher or halal; however, these are closely related to religion, which we also study, so we intentionally exclude them.
- 3.
Additional queries were formed for Sikhism and Jainism which did not return sufficient numbers of English speaking individuals to be included.
- 4.
Due to Twitter API rate limits, full edge information was gathered only for 1.7M pairs.
- 5.
One possibility for testing this hypothesis in future work is to identify a cohort of individuals who publicly signal these variables in an explicit way (e.g., including religious imagery in their profile picture) and then test for effects of tie strength on their peers’ predictiveness.
- 6.
This risk is valid even if the individual themselves does not engage with others, as platforms such as Twitter allow anyone to directly message another unless banned.
- 7.
We note that while we measure topical difference using our LDA model for messages, the peers selected by maximizing topical difference would be easily identified as such by the layperson (e.g., a peer discussing completely different topics).
References
Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: Proceedings of ICWSM (2012)
Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of COSN, pp. 69–82. ACM (2014)
Altenburger, K.M., Ugander, J.: Bias and variance in the social structure of gender. arXiv preprint arXiv:1705.04774 (2017)
Anderson, C., John, O.P., Keltner, D., Kring, A.M.: Who attains social status? effects of personality and physical attractiveness in social groups. J. Pers. Soc. Psychol. 81(1), 116 (2001)
Baker, W., Bowie, D.: Religious affiliation as a correlate of linguistic behavior. Univ. Pennsylvania Work. Pap. Linguist. 15(2), 2 (2010)
Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender identity and lexical variation in social media. J. Sociolinguist. 18(2), 135–160 (2014)
Barbieri, F.: Patterns of age-based linguistic variation in American English. J. Sociolinguist. 12(1), 58–88 (2008)
Beller, C., Knowles, R., Harman, C., Bergsma, S., Mitchell, M., Van Durme, B.: I’m a belieber: social roles via self-identification and conceptual attributes. In: Proceedings of ACL, pp. 181–186 (2014)
Benton, A., Mitchell, M., Hovy, D.: Multitask learning for mental health conditions with limited social media data. In: Proceedings of EACL (2017)
Bergsma, S., Van Durme, B.: Using conceptual class attributes to characterize social media users. In: Proceedings of ACL (2013)
Best, P., Manktelow, R., Taylor, B.: Online communication, social media and adolescent wellbeing: a systematic narrative review. Child Youth Serv. Rev. 41, 27–36 (2014)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. (JMLR) 3, 993–1022 (2003)
Bogardus, E.S.: A social distance scale. Sociol. Soc. Res. 17, 265–271 (1933)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC) 15(3), 12 (2012)
Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46(3), 904–911 (2014)
Bucholtz, M., Hall, K.: Identity and interaction: a sociocultural linguistic approach. Discourse Stud. 7(4–5), 585–614 (2005)
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of EMNLP, pp. 1301–1309 (2011)
Carpenter, J., Preotiuc-Pietro, D., Flekova, L., Giorgi, S., Hagan, C., Kern, M.L., Buffone, A.E., Ungar, L., Seligman, M.E.: Real men don’t say “cute” using automatic language analysis to isolate inaccurate aspects of stereotypes. Soc. Psychol. Pers. Sci. 8, 310–322 (2016)
Cesare, N., Grant, C., Nsoesie, E.O.: Detection of user demographics on social media: a review of methods and recommendations for best practices. arXiv preprint arXiv:1702.01807 (2017)
Chen, L., Weber, I., Okulicz-Kozaryn, A.: U.S. religious landscape on Twitter. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014. LNCS, vol. 8851, pp. 544–560. Springer, Cham (2014). doi:10.1007/978-3-319-13734-6_38
Chen, X., Wang, Y., Agichtein, E., Wang, F.: A comparative study of demographic attribute inference in Twitter. In: Proceedings of ICWSM, vol. 15, pp. 590–593 (2015)
Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of Twitter users in non-English contexts. In: Proceedings of EMNLP, pp. 1136–1145 (2013)
Coates, J.: Language and Gender: A Reader. Wiley-Blackwell, Oxford (1998)
Coates, J.: Women, Men and Language: A Sociolinguistic Account of Gender Differences in Language. Routledge, Abingdon (2015)
Danescu-Niculescu-Mizil, C., Gamon, M., Dumais, S.: Mark my words!: linguistic style accommodation in social media. In: Proceedings of WWW, pp. 745–754. ACM (2011)
De Choudhury, M., De, S.: Mental health discourse on reddit: self-disclosure, social support, and anonymity. In: Proceedings of ICWSM (2014)
De Choudhury, M., Kiciman, E.: The language of social support in social media and its effect on suicidal ideation risk. In: Proceedings of ICWSM, pp. 32–41 (2017)
Derlega, V.J., Harris, M.S., Chaikin, A.L.: Self-disclosure reciprocity, liking and the deviant. J. Exp. Soc. Psychol. 9(4), 277–284 (1973)
Dewaele, J.M.: Individual differences in the use of colloquial vocabulary: the effects of sociobiographical and psychological factors. In: Learning Vocabulary in a Second Language: Selection, Acquisition and Testing, pp. 127–153 (2004)
Duggan, M.: Mobile messaging and social media 2015. Pew Res. Center, 13 (2015)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). doi:10.1007/978-3-540-79228-4_1
Eagly, A.H., Mladinic, A.: Gender stereotypes and attitudes toward women and men. Pers. Soc. Psychol. Bull. 15(4), 543–558 (1989)
Eckert, P.: Jocks and Burnouts: Social Categories and Identity in the High School. Teachers College Press, New York (1989)
Eckert, P.: Age as a sociolinguistic variable. In: The Handbook of Sociolinguistics, pp. 151–167 (1997)
Eckert, P.: Variation and the indexical field. J. Sociolinguist. 12(4), 453–476 (2008)
Eckert, P., McConnell-Ginet, S.: Language and Gender. Cambridge University Press, New York (2003)
El-Arini, K., Paquet, U., Herbrich, R., Van Gael, J., Agüera y Arcas, B.: Transparent user models for personalization. In: Proceedings of KDD, pp. 678–686. ACM (2012)
Elgin, B., Robison, P.: How despots use Twitter to hunt dissidents. BloombergBusinessweek (2016). https://www.bloomberg.com/news/articles/2016-10-27/twitter-s-firehose-of-tweets-is-incredibly-valuable-and-just-as-dangerous
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res 15(1), 3133–3181 (2014)
Flekova, L., Gurevych, I.: Can we hide in the web? large scale simultaneous age and gender author profiling in social media. In: Proceedings of CLEF (2013)
Friedkin, N.: A test of structural features of Granovetter’s strength of weak ties theory. Soc. Netw. 2(4), 411–422 (1980)
Garimella, A., Mihalcea, R.: Zooming in on gender differences in social media. In: Proceedings of the Workshop on Computational Modeling of Peoples Opinions, Personality, and Emotions in Social Media, pp. 1–10 (2016)
Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Proceedings of CHI, pp. 211–220. ACM (2009)
Golbeck, J., Robles, C., Edmondson, M., Turner, K.: Predicting personality from Twitter. In: Proceedings of SocialCom, pp. 149–156. IEEE (2011)
Goldin, C., Rouse, C.: Orchestrating impartiality: the impact of “blind” auditions on female musicians. Technical report, National Bureau of Economic Research (1997)
Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gender. In: Proceedings of ICWSM (2009)
Granovetter, M.S.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973)
Hovy, D., Søgaard, A.: Tagging performance correlates with author age. In: Proceedings of ACL, pp. 483–488 (2015)
Hovy, D., Spruit, S.L.: The social impact of natural language processing. In: Proceedings of ACL, vol. 2, pp. 591–598 (2016)
John, O.P., Srivastava, S.: The big five trait taxonomy: history, measurement, and theoretical perspectives. In: Handbook of Personality: Theory and Research, vol. 2, pp. 102–138 (1999)
Kendall, S., Tannen, D., et al.: Gender and language in the workplace. In: Gender and Discourse, pp. 81–105. Sage, London (1997)
Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. Proc. Nat. Acad. Sci. (PNAS) 110(15), 5802–5805 (2013)
Krackhardt, D., Nohria, N., Eccles, B.: The strength of strong ties. Netw. Knowl. Econ., 82 (2003)
Labov, W.: Sociolinguistic Patterns. University of Pennsylvania Press, Philadelphia (1972)
Lakoff, R.T., Bucholtz, M.: Language and Woman’s Place: Text and Commentaries, vol. 3. Oxford University Press, USA (2004)
Lea, M., Spears, R., de Groot, D.: Knowing me, knowing you: anonymity effects on social identity processes within groups. Pers. Soc. Psychol. Bull. 27(5), 526–537 (2001)
Lin, N., Ensel, W.M., Vaughn, J.C.: Social resources and strength of ties: structural factors in occupational status attainment. Am. Sociol. Rev., 393–405 (1981)
Liviatan, I., Trope, Y., Liberman, N.: Interpersonal similarity as a social distance dimension: Implications for perception of others actions. J. Exp. Soc. Psychol. 44(5), 1256–1269 (2008)
Lu, X., Ai, W., Liu, X., Li, Q., Wang, N., Huang, G., Mei, Q.: Learning from the ubiquitous language: an empirical analysis of emoji usage of smartphone users. In: Proceedings of Ubicomp, pp. 770–780. ACM (2016)
Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Intell. Res. (JAIR) 30, 457–500 (2007)
Marder, B., Joinson, A., Shankar, A., Thirlaway, K.: Strength matters: self-presentation to the strongest audience rather than lowest common denominator when faced with multiple audiences in social network sites. Comput. Hum. Behav. 61, 56–62 (2016)
Marwick, A.E., Boyd, D.: I tweet honestly, i tweet passionately: Twitter users, context collapse, and the imagined audience. New Media Soc. 13(1), 114–133 (2011)
McCandless, M.: Accuracy and performance of Google’s compact language detector. Blog post (2010)
McCrae, R.R., Costa, P.T.: Reinterpreting the Myers-Briggs type indicator from the perspective of the five-factor model of personality. J. Pers. 57(1), 17–40 (1989)
McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001)
Milroy, J.: Linguistic variation and change: on the historical sociolinguistics of English. B. Blackwell (1992)
Minkus, T., Liu, K., Ross, K.W.: Children seen but not heard: when parents compromise children’s online privacy. In: Proceedings of WWW, pp. 776–786. ACM (2015)
Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexicon. Artif. Intell. 29(3), 436–465 (2013)
Monroe, B.L., Colaresi, M.P., Quinn, K.M.: Fightin’ words: lexical feature selection and evaluation for identifying the content of political conflict. Polit. Anal. 16(4), 372–403 (2008)
Nakagawa, S., Schielzeth, H.: A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol. Evol. 4(2), 133–142 (2013)
Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics (2011)
Nguyen, D.P., Gravel, R., Trieschnigg, R., Meder, T.: “how old do you think I am?” a study of language and age in Twitter. In: Proceedings of ICWSM (2013)
Nguyen, D.P., Trieschnigg, R., Doğruöz, A.S., Gravel, R., Theune, M., Meder, T., de Jong, F.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proceedings of COLING (2014)
Nguyen, M.T., Lim, E.P.: On predicting religion labels in microblogging networks. In: Proceedings of SIGIR, pp. 1211–1214. ACM (2014)
Niederhoffer, K.G., Pennebaker, J.W.: Linguistic style matching in social interaction. J. Lang. Soc. Psychol. 21(4), 337–360 (2002)
Oomen, I., Leenes, R.: Privacy risk perceptions and privacy protection strategies. In: de Leeuw, E., Fischer-Hübner, S., Tseng, J., Borking, J. (eds.) IDMAN 2007. TIFIP, vol. 261, pp. 121–138. Springer, Boston, MA (2008). doi:10.1007/978-0-387-77996-6_10
Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)
Pennacchiotti, M., Popescu, A.M.: A machine learning approach to Twitter user classification. In: Proceedings of ICWSM, pp. 281–288 (2011)
Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Pers. Soc. Psychol. 85(2), 291 (2003)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of EMNLP, vol. 14, pp. 1532–1543 (2014)
Phelan, C., Lampe, C., Resnick, P.: It’s creepy, but it doesn’t bother me. In: Proceedings of CHI, pp. 5240–5251. ACM (2016)
Plank, B., Hovy, D.: Personality traits on TwitterorHow to get 1,500 personality tests in a week. In: Proceedings of WASSA (2015)
Postmes, T., Spears, R., Lea, M.: Breaching or building social boundaries? SIDE-effects of computer-mediated communication. Commun. Res. 25(6), 689–715 (1998)
Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Proceedings of CLEF (Working Notes), pp. 716–749 (2016)
Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J.: Our Twitter profiles, our selves: predicting personality with Twitter. In: Proceedings of SocialCom, pp. 180–185. IEEE (2011)
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in Twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, pp. 37–44. ACM (2010)
Reddy, S., Knight, K.: Obfuscating gender in social media writing. In: Proceedings of Workshop on Natural Language Processing and Computational Social Science, pp. 17–26 (2016)
Reed, P.J., Spiro, E.S., Butts, C.T.: Thumbs up for privacy?: differences in online self-disclosure behavior across national cultures. Soc. Sci. Res. 59, 155–170 (2016)
Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of ACL, pp. 763–772. Association for Computational Linguistics (2011)
Rossi, L., Magnani, M.: Conversation practices and network structure in Twitter. In: Proceedings of ICWSM (2012)
Ryan, E.B., Hummert, M.L., Boich, L.H.: Communication predicaments of aging patronizing behavior toward older adults. J. Lang. Soc. Psychol. 14(1–2), 144–166 (1995)
Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., Schwartz, H.A.: Developing age and gender predictive lexica over social media. In: Proceedings of EMNLP, pp. 1146–1151. Association for Computational Linguistics (2014)
Schnoebelen, T.J.: Emotions are relational: positioning and the use of affective linguistic resources. Ph.D. thesis, Stanford University (2012)
Schrammel, J., Köffel, C., Tscheligi, M.: Personality traits, usage patterns and information disclosure in online communities. In: Proceedings of HCI, pp. 169–174. British Computer Society (2009)
Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.S., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013)
Shelton, M., Lo, K., Nardi, B.: Online media forums as separate social lives: a qualitative study of disclosure within and beyond Reddit. In: Proceedings of iConference (2015)
Snefjella, B., Kuperman, V.: Concreteness and psychological distance in natural language use. Psychol. Sci. 26(9), 1449–1460 (2015)
Soderberg, C., Callahan, S., Kochersberger, A., Amit, E., Ledgerwood, A.: The effects of psychological distance on abstraction: two meta-analyses. Psychol. Bull. 141(3), 525–548 (2015)
Spears, R., Lea, M.: Social influence and the influence of the “social” in computer-mediated communication. In: Lea, M. (ed.) Contexts of Computer-Mediated Communication, pp. 30–65. Harvester Wheatsheaf (1992)
Steinpreis, R.E., Anders, K.A., Ritzke, D.: The impact of gender on the review of the curricula vitae of job applicants and tenure candidates: a national empirical study. Sex Roles 41(7), 509–528 (1999)
Strater, K., Lipford, H.R.: Strategies and struggles with privacy in an online social networking community. In: Proceedings of the 22nd British HCI Group Annual Conference on People and Computers: Culture, Creativity, Interaction, vol. 1, pp. 111–119. British Computer Society (2008)
Stutzman, F., Vitak, J., Ellison, N.B., Gray, R., Lampe, C.: Privacy in interaction: exploring disclosure and social capital in Facebook. In: Proceedings of ICWSM (2012)
Tannen, D.: You Just Don’t Understand: Women and Men in Conversation. Virago, London (1991)
Tannen, D.: Gender and Conversational Interaction. Oxford University Press, Oxford (1993)
Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 24–54 (2010)
Tchokni, S.E., Séaghdha, D.O., Quercia, D.: Emoticons and phrases: status symbols in social media. In: Proceedings of ICWSM (2014)
Thomas, K., Grier, C., Nicol, D.M.: unFriendly: multi-party privacy risks in social networks. In: Atallah, M.J., Hopper, N.J. (eds.) PETS 2010. LNCS, vol. 6205, pp. 236–252. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14527-8_14
Trepte, S., Reinecke, L., Ellison, N.B., Quiring, O., Yao, M.Z., Ziegele, M.: A cross-cultural perspective on the privacy calculus. Soc. Media+ Soc. 3(1), 2056305116688035 (2017)
Trope, Y., Liberman, N.: Construal-level theory of psychological distance. Psychol. Rev. 117(2), 440 (2010)
Volkova, S., Bachrach, Y., Armstrong, M., Sharma, V.: Inferring latent user properties from texts published in social media. In: Proceedings of AAAI, pp. 4296–4297 (2015)
Wienberg, C., Gordon, A.S.: Privacy considerations for public storytelling. In: Proceedings of ICWSM (2014)
Yaeger-Dror, M.: Religion as a sociolinguistic variable. Language and Linguistics Compass 8(11), 577–589 (2014)
Youn, S., Hall, K.: Gender and online privacy among teens: risk perception, privacy concerns, and protection behaviors. Cyberpsychol. Behav. 11(6), 763–765 (2008)
Zhang, K., Kizilcec, R.F.: Anonymity in social media: effects of content controversiality and social endorsement on sharing behavior. In: Proceedings of ICWSM (2014)
Acknowledgments
We thank the anonymous reviewers, SocInfo organizers, the Stanford Data Science Initiative, and Twitter and Gnip for providing access to part of data used in this study. This work was supported by the National Science Foundation through awards IIS-1159679 and IIS-1526745.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Classification Metrics
Macro-averaged F1 denotes the average F1 for each class, independent of how many instances were seen for that label. Micro-averaged F1 denotes the F1 measured from all instances and is sensitive to the skew in the distribution of classes in the dataset.
B Additional Classifier Results
C Additional Measures of Tie Strength
We initially considered two other potential proxies for tie strength based on textual analysis. First, we replicated the approach of Gilbert and Karahalios [44] which counted words occurring in ten LIWC categories to approximate intimacy in communication. Second, we attempted to measure social distance [58] by drawing upon Construal Theory [59, 110] which conjectures that individuals with low social distance typically use more concrete language, whereas those with high social distance use more abstract language [98, 99]; here, communication concreteness was measured using the word concreteness ratings of [16]. However, we found that the ratings for each approach did not match our judgments for their respective intended attributes and their use in the regression models produced non-significant results. Without ground truth for intimacy and social distance to validate their ratings, we therefore omitted these proxies based on our judgment of their unreliability to avoid drawing false conclusions about these dimensions of tie strength.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Jurgens, D., Tsvetkov, Y., Jurafsky, D. (2017). Writer Profiling Without the Writer’s Text. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10540. Springer, Cham. https://doi.org/10.1007/978-3-319-67256-4_43
Download citation
DOI: https://doi.org/10.1007/978-3-319-67256-4_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67255-7
Online ISBN: 978-3-319-67256-4
eBook Packages: Computer ScienceComputer Science (R0)