Abstract
Conventional approaches to gender classification much rely on a large scale of labeled data, which is normally hard and expensive to obtain. In this paper, we propose a co-training approach to address this problem in gender classification. Specifically, we employ both non-interactive and interactive texts, i.e., the message and comment texts, as two different views in our co-training approach to well incorporate unlabeled data. Experimental results on a large data set from micro-blog demonstrate the appropriateness of leveraging interactive knowledge in gender classification and the effectiveness of the proposed co-training approach in gender classification.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
Blum, A., Mitchell, T.: Combing labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Corney, M., Vel, O., Anderson, A., Mohay, G.: Gender-preferential text mining of E-mail discourse. In: Proceedings of the 18th Annual Computer Security Applications Conference, pp. 282–289 (2002)
Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: Proceedings of EMNLP-13, pp. 1136–1145 (2013)
Gianfortoni, P., Adamson, D., Rosé, C.: Modeling of stylistic variation in social media with stretchy patterns. In: Proceedings of EMNLP-11, pp. 49–59 (2011)
Ikeda, D., Takamura, H., Okumura, M.: Semi-supervised learning for blog classification. In: Proceedings of AAAI-08, pp. 1156–1161 (2008)
Filippova, K.: User demographics and language in an implicit social network. In: Proceedings of EMNLP-12, pp. 1478–1488 (2012)
Heylighen, F., Dewaele, J.: Variation in the contextuality of language: an empirical measure. Proc. Found. Sci. 7, 293–340 (2002)
Liu, N., He, Y., Chen, Q., Peng, M., Tian, Y.: A new method for micro-blog platform users classification based on infinitesimal-time. J. Inf. Computantional Sci. 10(9), 2569–2579 (2013)
Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of EMNLP-11, pp. 207–217 (2010)
Nowson, S., Oberlander, J.: The identity of bloggers: openness and gender in personal weblogs. In: Proceedings of AAAI-06, pp. 163–167 (2006)
Peersman, C., Daelemans, W., Vaerenbergh, L.: Predicting age and gender in online social networks. In: SMUC 2010 Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, pp. 37–44 (2010)
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceeding SMUC 2010 Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, pp. 37–44 (2010)
Volkova, S., Wilson, T., Yarowsky, D.: Exploring demographic language variations to improve multilingual sentiment analysis in social media. In: Proceedings of EMNLP-13, pp. 1815–1827 (2013)
Acknowledgments
This research work has been partially supported by three NSFC grants, No. 61273320, No.61375073, No.61331011, and Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, J., Xue, Y., Li, S., Zhou, G. (2015). Leveraging Interactive Knowledge and Unlabeled Data in Gender Classification with Co-training. In: Liu, A., Ishikawa, Y., Qian, T., Nutanong, S., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9052. Springer, Cham. https://doi.org/10.1007/978-3-319-22324-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-22324-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22323-0
Online ISBN: 978-3-319-22324-7
eBook Packages: Computer ScienceComputer Science (R0)