Abstract
Demographic attributes such as gender and age of Internet users provide important information for marketing, personalization, and user behavior research. This paper addresses the problem of predicting users’ gender based on browsing history. We employ a classification-based approach to the problem and investigate a number of features derived from browsing log data. We show that high-level content features such as topics or categories are very predictive of gender and combining such features with features derived from access times and browsing patterns leads to significant improvements in prediction accuracy. We empirically verified the effectiveness of the method on real datasets from Vietnamese online media. The method substantially outperformed a baseline, and achieved a macro-averaged F1 score of 0.805. Experimental results also demonstrate the effectiveness of combining different feature types: a combination of features achieved 12% improvement of F1 score over the best performing individual feature type.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proc. of EMNLP 2011, pp. 1301–1309 (2011)
Computerworld Report: Men Want Facts, Women Seek Personal Connections on Web, http://www.computerworld.com/s/article/107391/Study_Men_want_facts_women_seek_personal_connections_on_Web
Ellist, D.: Social (distributed) language modeling, clustering and dialectometry. In: Proc. of TextGraphs at ACL-IJCNLP 2009, pp. 1–4 (2009)
Filippova, K.: User demographics and language in an implicit social network. In: Proceedings of EMNLP-CoNLL 2012 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1478–1488 (2012)
Garera, N., Yarowsky, D.: Modeling latent biographic attributes in conversational genres. In: Proc. of ACL-IJCNLP 2009, pp. 710–718 (2009)
Gillick, D.: Can conversational word usage be used to predict speaker demographics? In: Proceedings of Interspeech, Makuhari, Japan (2010)
Herring, S.C., Paolillo, J.C.: Gender and genre variation in weblogs. Journal of Sociolinguistics 10(4), 710–718 (2010)
Herring, S.C., Scheidt, L.A., Bonus, S., Wright, E.: Bridging the gap: A genre analysis of weblogs. In: HICSS 2004 (2004)
Hu, J., Zeng, H.J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing behavior. In: Proceedings of the 16th International Conference on World Wide Web, pp. 151–160 (2007)
Kabbur, S., Han, E.H., Karypis, G.: Content-based methods for predicting web-site demographic attributes. In: Proceedings of ICDM 2010 (2010)
MacKinnon, I., Warren, R.: Age and geographic inferences of the LiveJournal social network. In: Statistical Network Analysis: Models, Issues, and New Directions Workshop at ICML 2006, Pittsburgh, PA (June 29, 2006)
Mulac, A., Seibold, D.R., Farris, J.R.: Female and male managers’ and professionals’ criticism giving: Differences in language use and effects. Journal of Language and Social Psychology 19(4), 389–415 (2000)
Nowson, S., Oberlander, J.: The identity of bloggers: Openness and gender in personal weblogs. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 163–167 (2006)
Otterbacher, J.: Inferring Gender of Movie Reviewers: Exploiting Writing Style, Content and Metadata. In: Proceedings of CIKM 2010 (2010)
Pennachiotti, M., Popescu, A.M.: A machine learning approach to Twitter user classification. In: Proceedings of AAAI 2011 (2011)
Phuong, D.V., Phuong, T.M.: A keyword-topic model for contextual advertising. In: Proceedings of SoICT 2010 (2012)
Popescu, A., Grefenstette, G.: Mining user home location and gender from Flickr tags. In: Proc. of ICWSM 2010, pp. 1873–1876 (2010)
Rosenthal, S., McKeown, K.: Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proc. of ACL 2011, pp. 763–772 (2011)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 199–205 (2006)
Search Engine Watch Journal, Behavioral Targeting and Contextual Advertising, http://www.searchenginejournal.com/?p=836
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Processing KDD 2004. ACM, New York (2004)
Yan, X., Yan, L.: Gender classification of weblogs authors. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 228–230 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Phuong, D.V., Phuong, T.M. (2014). Gender Prediction Using Browsing History. In: Huynh, V., Denoeux, T., Tran, D., Le, A., Pham, S. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 244. Springer, Cham. https://doi.org/10.1007/978-3-319-02741-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-02741-8_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02740-1
Online ISBN: 978-3-319-02741-8
eBook Packages: EngineeringEngineering (R0)