Simply the Best: Minimalist System Trumps Complex Models in Author Profiling

Basile, Angelo; Dwyer, Gareth; Medvedeva, Maria; Rawee, Josine; Haagsma, Hessel; Nissim, Malvina

doi:10.1007/978-3-319-98932-7_14

Simply the Best: Minimalist System Trumps Complex Models in Author Profiling

Conference paper
First Online: 15 August 2018

1070 Accesses
3 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11018))

Abstract

A simple linear SVM with word and character n-gram features and minimal parameter tuning can identify the gender and the language variety (for English, Spanish, Arabic and Portuguese) of Twitter users with very high accuracy. All our attempts at improving performance by including more data, smarter features, and employing more complex architectures plainly fail. In addition, we experiment with joint and multitask modelling, but find that they are clearly outperformed by single task models. Eventually, our simplest model was submitted to the PAN 2017 shared task on author profiling, obtaining an average accuracy of 0.86 on the test set, with performance on sub-tasks ranging from 0.68 to 0.98. These were the best results achieved at the competition overall. To allow lay people to easily use and see the value of machine learning for author profiling, we also built a web application on top our models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This is the training set released at PAN 2017. An additional test set was available for testing models during the campaign, but not anymore at the time of writing.
2.
https://spacy.io/.
3.
https://blog.swiftkey.com/americans-love-skulls-brazilians-love-cats-swiftkey-emoji-meanings-report/.
4.
http://www.unicode.org/emoji/charts/full-emoji-list.html.
5.
https://aabeta.herokuapp.com.
6.
https://news.microsoft.com/features/democratizing-ai/.

References

Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender identity and lexical variation in social media. J. Socioling. 18(2), 135–160 (2014)
Article Google Scholar
Benton, A., Mitchell, M., Hovy, D.: Multitask learning for mental health conditions with limited social media data. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 152–162 (2017)
Google Scholar
Bestgen, Y.: Improving the character NGRAM model for the DSL task with BM25 weighting and less frequently used feature sets. In: Proceedings of the VarDial Workshop (2017)
Google Scholar
Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: AISTATS, vol. 351, pp. 423–424 (2012)
Google Scholar
Busger op Vollenbroek, M., et al.: GronUP: Groningen user profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings, pp. 846–857. CEUR-WS.org (2016)
Google Scholar
Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997)
Article Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
MATH Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kessler, J.S.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. Association for Computational Linguistics, Vancouver (2017)
Google Scholar
Liu, X., Gao, J., He, X., Deng, L., Duh, K., Wang, Y.Y.: Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: Proceedings of NAACL (2015)
Google Scholar
López-Monroy, A.P., Montes-y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Solorio, T.: Social-media users can be profiled by their similarity with other users. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14. The COLING 2016 Organizing Committee, Osaka, December 2016
Google Scholar
Markov, I., Gómez-Adorno, H., Sidorov, G.: Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar
Martinc, M., Škrjanec, I., Zupan, K., Pollak, S.: Pan 2017: author profiling - gender and language variety prediction. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar
Martínez Alonso, H., Plank, B.: When is multitask learning effective? Semantic sequence prediction under varying data conditions. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 44–53. Association for Computational Linguistics, Valencia, April 2017. http://www.aclweb.org/anthology/E17-1005
Medvedeva, M., Haagsma, H., Nissim, M.: An analysis of cross-genre and in-genre performance for author profiling in social media. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 211–223. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_21
Chapter Google Scholar
Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 156–163. Association for Computational Linguistics (2017)
Google Scholar
Miura, Y., Taniguchi, T., Taniguchi, M., Ohkuma, T.: Author profiling with word+character neural attention network. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar
Neubig, G., et al.: DyNet: the dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN 2017. In: Jones, G.J.H., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 275–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_25
Chapter Google Scholar
Poulston, A., Waseem, Z., Stevenson, M.: Using TF-IDF n-gram and word embedding cluster ensembles for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar
Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. arXiv preprint arXiv:1705.10754 (2017)
Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org (2017)
Google Scholar
Rangel, F., Rosso, P., Verhoeven, B., Potthast, W.D.M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes of CLEF, pp. 750–784 (2016)
Google Scholar
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS one 8(9), e73791 (2013)
Article Google Scholar
Tellez, E.S., Miranda-Jiménez, S., Graff, M., Moctezuma, D.: Gender and language variety identification with MicroTC. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)
Google Scholar

Download references

Acknowledgements

We are grateful to the organisers of PAN 2017 for making the data available. We also would like to thank Barbara Plank for her advice on the MTL architecture and the anonymous reviewers for providing valuable insights.

Author information

Authors and Affiliations

Faculty of ICT, University of Malta, Msida, Malta
Angelo Basile
Center for Mind/Brain Sciences, University of Trento, Rovereto, Italy
Josine Rawee
Center for Language and Cognition, University of Groningen, Groningen, The Netherlands
Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma & Malvina Nissim

Authors

Angelo Basile
View author publications
You can also search for this author in PubMed Google Scholar
Gareth Dwyer
View author publications
You can also search for this author in PubMed Google Scholar
Maria Medvedeva
View author publications
You can also search for this author in PubMed Google Scholar
Josine Rawee
View author publications
You can also search for this author in PubMed Google Scholar
Hessel Haagsma
View author publications
You can also search for this author in PubMed Google Scholar
Malvina Nissim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Malvina Nissim .

Editor information

Editors and Affiliations

Aix-Marseille University, Marseille Cedex 20, France
Patrice Bellot
Virtual University of Tunis, Tunis, Tunisia
Chiraz Trabelsi
Systèmes d’informations, Big Data et Rec, Institut de Recherche en Informatique de, Toulouse Cedex 04, France
Josiane Mothe
Department of Computer Science, University of Huddersfield, Huddersfield, United Kingdom
Fionn Murtagh
DIRO, Universite de Montreal, Montreal, Québec, Canada
Jian Yun Nie
Pierre and Marie Curie University, Paris Cedex 05, France
Laure Soulier
Université d'Avignon et des Pays de, Avignon, France
Eric SanJuan
Department of Information Engineering, University of Padua, Padua, Padova, Italy
Linda Cappellato
University of Padua, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M. (2018). Simply the Best: Minimalist System Trumps Complex Models in Author Profiling. In: Bellot, P., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2018. Lecture Notes in Computer Science(), vol 11018. Springer, Cham. https://doi.org/10.1007/978-3-319-98932-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-98932-7_14
Published: 15 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98931-0
Online ISBN: 978-3-319-98932-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics