skip to main content
10.1145/3511095.3536358acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
extended-abstract

Revealing the Demographic Attributes of the Authors from the Abstracts of Scientific Articles

Published:28 June 2022Publication History

ABSTRACT

This study presents multiple strategies to automatically reveal undisclosed demographic attributes of the authors in the double-blind submissions. From a limited amount of textual content of around 100-200 words excerpted from an abstract, this study aims to reveal the following pieces of information, i) the English language nativeness of the primary author, ii) the country of origin of the primary author, and iii) the gender of the primary author. We introduce an annotated dataset of over 5600 articles labeled with the native language, country of origin, and gender information of the primary authors. We employ classical machine learning (CML) algorithms with statistical n-gram features and transformer-based fine-tuned language models to determine various demographic attributes. We observe that transformer-based models yield slightly better performances for all three tasks. The transformer-based models achieve macro F1 scores close to 75% for identifying the English language nativeness of the primary authors. To determine the country of the non-native English authors, the fine-tuned transformer-based models obtain F1 scores of around 60% (10-class classification). For the gender prediction task, we attain F1 scores of 0.65 by the transformer-based models. The experimental results demonstrate that the fine-tuned language models and CML classifiers are capable of disclosing various author attributes with an acceptable level of accuracy that can undermine the blindness of the double-blind submission.

Skip Supplemental Material Section

Supplemental Material

demography_scientific.mp4

mp4

20.2 MB

demography_scientific.mp4

mp4

20.2 MB

References

  1. Douglas Bagnall. 2015. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891(2015).Google ScholarGoogle Scholar
  2. David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135–160.Google ScholarGoogle ScholarCross RefCross Ref
  3. Surajit Bhattacharya 2010. Authorship issue explained. Indian J Plast Surg 43, 2 (2010), 233–4.Google ScholarGoogle ScholarCross RefCross Ref
  4. Cornelia Caragea, Ana Uban, and Liviu P Dinu. 2019. The myth of double-blind review revisited: ACL vs. EMNLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2317–2327.Google ScholarGoogle ScholarCross RefCross Ref
  5. Stephen J Ceci and Douglas Peters. 1984. How blind is blind review?American Psychologist 39, 12 (1984), 1491.Google ScholarGoogle Scholar
  6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google ScholarGoogle Scholar
  7. Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 conference on empirical methods in natural language processing. 3591–3601.Google ScholarGoogle ScholarCross RefCross Ref
  8. Shawndra Hill and Foster Provost. 2003. The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 179–184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Graeme Hirst and Ol’ga Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 4 (2007), 405–417.Google ScholarGoogle ScholarCross RefCross Ref
  10. Julian Hitschler, Esther Van Den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.Google ScholarGoogle ScholarCross RefCross Ref
  11. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and linguistic computing 17, 4 (2002), 401–412.Google ScholarGoogle Scholar
  12. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 624–628.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Carole J Lee, Cassidy R Sugimoto, Guo Zhang, and Blaise Cronin. 2013. Bias in peer review. Journal of the American Society for Information Science and Technology 64, 1 (2013), 2–17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  15. Wen Li and Markus Dickinson. 2017. Gender prediction for Chinese social media data. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017. 438–445.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).Google ScholarGoogle Scholar
  17. Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557(2011).Google ScholarGoogle Scholar
  18. Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation 47, 4 (2013), 919–944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evidence beyond topic and genre. In Proceedings of the fifteenth conference on computational natural language learning. 78–86.Google ScholarGoogle Scholar
  20. Salim Sazzed. 2021. A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 1281–1288.Google ScholarGoogle ScholarCross RefCross Ref
  21. Salim Sazzed. 2022. Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency. (2022).Google ScholarGoogle Scholar
  22. Andrew Tomkins, Min Zhang, and William D Heavlin. 2017. Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences 114, 48(2017), 12708–12713.Google ScholarGoogle ScholarCross RefCross Ref
  23. Teja Tscharntke, Michael E Hochberg, Tatyana A Rand, Vincent H Resh, and Jochen Krauss. 2007. Author sequence and credit for contributions in multiauthored publications. PLoS biology 5, 1 (2007), e18.Google ScholarGoogle Scholar
  24. Vered Volansky, Noam Ordan, and Shuly Wintner. 2015. On the features of translationese. Digital Scholarship in the Humanities 30, 1 (2015), 98–118.Google ScholarGoogle ScholarCross RefCross Ref
  25. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6Google ScholarGoogle ScholarCross RefCross Ref
  26. Chuhan Wu, Fangzhao Wu, Tao Qi, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019. Neural gender prediction in microblogging with emotion-aware user representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2401–2404.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media
    June 2022
    272 pages
    ISBN:9781450392334
    DOI:10.1145/3511095

    Copyright © 2022 Owner/Author

    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 28 June 2022

    Check for updates

    Qualifiers

    • extended-abstract
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate378of1,158submissions,33%

    Upcoming Conference

    HT '24
    35th ACM Conference on Hypertext and Social Media
    September 10 - 13, 2024
    Poznan , Poland
  • Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format