Skip to main content

Simply the Best: Minimalist System Trumps Complex Models in Author Profiling

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11018))

Abstract

A simple linear SVM with word and character n-gram features and minimal parameter tuning can identify the gender and the language variety (for English, Spanish, Arabic and Portuguese) of Twitter users with very high accuracy. All our attempts at improving performance by including more data, smarter features, and employing more complex architectures plainly fail. In addition, we experiment with joint and multitask modelling, but find that they are clearly outperformed by single task models. Eventually, our simplest model was submitted to the PAN 2017 shared task on author profiling, obtaining an average accuracy of 0.86 on the test set, with performance on sub-tasks ranging from 0.68 to 0.98. These were the best results achieved at the competition overall. To allow lay people to easily use and see the value of machine learning for author profiling, we also built a web application on top our models.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This is the training set released at PAN 2017. An additional test set was available for testing models during the campaign, but not anymore at the time of writing.

  2. 2.

    https://spacy.io/.

  3. 3.

    https://blog.swiftkey.com/americans-love-skulls-brazilians-love-cats-swiftkey-emoji-meanings-report/.

  4. 4.

    http://www.unicode.org/emoji/charts/full-emoji-list.html.

  5. 5.

    https://aabeta.herokuapp.com.

  6. 6.

    https://news.microsoft.com/features/democratizing-ai/.

References

  1. Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender identity and lexical variation in social media. J. Socioling. 18(2), 135–160 (2014)

    Article  Google Scholar 

  2. Benton, A., Mitchell, M., Hovy, D.: Multitask learning for mental health conditions with limited social media data. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 152–162 (2017)

    Google Scholar 

  3. Bestgen, Y.: Improving the character NGRAM model for the DSL task with BM25 weighting and less frequently used feature sets. In: Proceedings of the VarDial Workshop (2017)

    Google Scholar 

  4. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: AISTATS, vol. 351, pp. 423–424 (2012)

    Google Scholar 

  5. Busger op Vollenbroek, M., et al.: GronUP: Groningen user profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings, pp. 846–857. CEUR-WS.org (2016)

    Google Scholar 

  6. Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997)

    Article  Google Scholar 

  7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  8. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  9. Kessler, J.S.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. Association for Computational Linguistics, Vancouver (2017)

    Google Scholar 

  10. Liu, X., Gao, J., He, X., Deng, L., Duh, K., Wang, Y.Y.: Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: Proceedings of NAACL (2015)

    Google Scholar 

  11. López-Monroy, A.P., Montes-y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Solorio, T.: Social-media users can be profiled by their similarity with other users. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

  12. Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14. The COLING 2016 Organizing Committee, Osaka, December 2016

    Google Scholar 

  13. Markov, I., Gómez-Adorno, H., Sidorov, G.: Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

  14. Martinc, M., Škrjanec, I., Zupan, K., Pollak, S.: Pan 2017: author profiling - gender and language variety prediction. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

  15. Martínez Alonso, H., Plank, B.: When is multitask learning effective? Semantic sequence prediction under varying data conditions. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 44–53. Association for Computational Linguistics, Valencia, April 2017. http://www.aclweb.org/anthology/E17-1005

  16. Medvedeva, M., Haagsma, H., Nissim, M.: An analysis of cross-genre and in-genre performance for author profiling in social media. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 211–223. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_21

    Chapter  Google Scholar 

  17. Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 156–163. Association for Computational Linguistics (2017)

    Google Scholar 

  18. Miura, Y., Taniguchi, T., Taniguchi, M., Ohkuma, T.: Author profiling with word+character neural attention network. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

  19. Neubig, G., et al.: DyNet: the dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017)

  20. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  21. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN 2017. In: Jones, G.J.H., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 275–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_25

    Chapter  Google Scholar 

  22. Poulston, A., Waseem, Z., Stevenson, M.: Using TF-IDF n-gram and word embedding cluster ensembles for author profiling. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

  23. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. arXiv preprint arXiv:1705.10754 (2017)

  24. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org (2017)

    Google Scholar 

  25. Rangel, F., Rosso, P., Verhoeven, B., Potthast, W.D.M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes of CLEF, pp. 750–784 (2016)

    Google Scholar 

  26. Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS one 8(9), e73791 (2013)

    Article  Google Scholar 

  27. Tellez, E.S., Miranda-Jiménez, S., Graff, M., Moctezuma, D.: Gender and language variety identification with MicroTC. In: Working Notes of CLEF, CEUR Workshop Proceedings. CEUR-WS.org (2017)

    Google Scholar 

Download references

Acknowledgements

We are grateful to the organisers of PAN 2017 for making the data available. We also would like to thank Barbara Plank for her advice on the MTL architecture and the anonymous reviewers for providing valuable insights.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Malvina Nissim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M. (2018). Simply the Best: Minimalist System Trumps Complex Models in Author Profiling. In: Bellot, P., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2018. Lecture Notes in Computer Science(), vol 11018. Springer, Cham. https://doi.org/10.1007/978-3-319-98932-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98932-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98931-0

  • Online ISBN: 978-3-319-98932-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics