Abstract
A simple linear SVM with word and character n-gram features and minimal parameter tuning can identify the gender and the language variety (for English, Spanish, Arabic and Portuguese) of Twitter users with very high accuracy. All our attempts at improving performance by including more data, smarter features, and employing more complex architectures plainly fail. In addition, we experiment with joint and multitask modelling, but find that they are clearly outperformed by single task models. Eventually, our simplest model was submitted to the PAN 2017 shared task on author profiling, obtaining an average accuracy of 0.86 on the test set, with performance on sub-tasks ranging from 0.68 to 0.98. These were the best results achieved at the competition overall. To allow lay people to easily use and see the value of machine learning for author profiling, we also built a web application on top our models.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
This is the training set released at PAN 2017. An additional test set was available for testing models during the campaign, but not anymore at the time of writing.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender identity and lexical variation in social media. J. Socioling. 18(2), 135–160 (2014)
Benton, A., Mitchell, M., Hovy, D.: Multitask learning for mental health conditions with limited social media data. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 152–162 (2017)
Bestgen, Y.: Improving the character NGRAM model for the DSL task with BM25 weighting and less frequently used feature sets. In: Proceedings of the VarDial Workshop (2017)
Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: AISTATS, vol. 351, pp. 423–424 (2012)
Busger op Vollenbroek, M., et al.: GronUP: Groningen user profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings, pp. 846–857. CEUR-WS.org (2016)
Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kessler, J.S.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. Association for Computational Linguistics, Vancouver (2017)
Liu, X., Gao, J., He, X., Deng, L., Duh, K., Wang, Y.Y.: Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: Proceedings of NAACL (2015)
López-Monroy, A.P., Montes-y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Solorio, T.: Social-media users can be profiled by their similarity with other users. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14. The COLING 2016 Organizing Committee, Osaka, December 2016
Markov, I., Gómez-Adorno, H., Sidorov, G.: Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Martinc, M., Škrjanec, I., Zupan, K., Pollak, S.: Pan 2017: author profiling - gender and language variety prediction. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Martínez Alonso, H., Plank, B.: When is multitask learning effective? Semantic sequence prediction under varying data conditions. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 44–53. Association for Computational Linguistics, Valencia, April 2017. http://www.aclweb.org/anthology/E17-1005
Medvedeva, M., Haagsma, H., Nissim, M.: An analysis of cross-genre and in-genre performance for author profiling in social media. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 211–223. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_21
Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 156–163. Association for Computational Linguistics (2017)
Miura, Y., Taniguchi, T., Taniguchi, M., Ohkuma, T.: Author profiling with word+character neural attention network. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Neubig, G., et al.: DyNet: the dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN 2017. In: Jones, G.J.H., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 275–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_25
Poulston, A., Waseem, Z., Stevenson, M.: Using TF-IDF n-gram and word embedding cluster ensembles for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. arXiv preprint arXiv:1705.10754 (2017)
Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org (2017)
Rangel, F., Rosso, P., Verhoeven, B., Potthast, W.D.M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes of CLEF, pp. 750–784 (2016)
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS one 8(9), e73791 (2013)
Tellez, E.S., Miranda-Jiménez, S., Graff, M., Moctezuma, D.: Gender and language variety identification with MicroTC. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Acknowledgements
We are grateful to the organisers of PAN 2017 for making the data available. We also would like to thank Barbara Plank for her advice on the MTL architecture and the anonymous reviewers for providing valuable insights.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M. (2018). Simply the Best: Minimalist System Trumps Complex Models in Author Profiling. In: Bellot, P., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2018. Lecture Notes in Computer Science(), vol 11018. Springer, Cham. https://doi.org/10.1007/978-3-319-98932-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-98932-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98931-0
Online ISBN: 978-3-319-98932-7
eBook Packages: Computer ScienceComputer Science (R0)