extended-abstract

Revealing the Demographic Attributes of the Authors from the Abstracts of Scientific Articles

Author:
Salim Sazzed

Old Dominion University, USA

Old Dominion University, USA
View Profile

HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social MediaJune 2022Pages 209–213https://doi.org/10.1145/3511095.3536358

Published:28 June 2022Publication History

HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media

Pages 209–213

ABSTRACT

This study presents multiple strategies to automatically reveal undisclosed demographic attributes of the authors in the double-blind submissions. From a limited amount of textual content of around 100-200 words excerpted from an abstract, this study aims to reveal the following pieces of information, i) the English language nativeness of the primary author, ii) the country of origin of the primary author, and iii) the gender of the primary author. We introduce an annotated dataset of over 5600 articles labeled with the native language, country of origin, and gender information of the primary authors. We employ classical machine learning (CML) algorithms with statistical n-gram features and transformer-based fine-tuned language models to determine various demographic attributes. We observe that transformer-based models yield slightly better performances for all three tasks. The transformer-based models achieve macro F1 scores close to 75% for identifying the English language nativeness of the primary authors. To determine the country of the non-native English authors, the fine-tuned transformer-based models obtain F1 scores of around 60% (10-class classification). For the gender prediction task, we attain F1 scores of 0.65 by the transformer-based models. The experimental results demonstrate that the fine-tuned language models and CML classifiers are capable of disclosing various author attributes with an acceptable level of accuracy that can undermine the blindness of the double-blind submission.

Supplemental Material

demography_scientific.mp4

mp4

20.2 MB

Download

demography_scientific.mp4

mp4

20.2 MB

Download

References

Douglas Bagnall. 2015. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891(2015).Google Scholar
David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135–160.Google ScholarCross Ref
Surajit Bhattacharya 2010. Authorship issue explained. Indian J Plast Surg 43, 2 (2010), 233–4.Google ScholarCross Ref
Cornelia Caragea, Ana Uban, and Liviu P Dinu. 2019. The myth of double-blind review revisited: ACL vs. EMNLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2317–2327.Google ScholarCross Ref
Stephen J Ceci and Douglas Peters. 1984. How blind is blind review?American Psychologist 39, 12 (1984), 1491.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 conference on empirical methods in natural language processing. 3591–3601.Google ScholarCross Ref
Shawndra Hill and Foster Provost. 2003. The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 179–184.Google ScholarDigital Library
Graeme Hirst and Ol’ga Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 4 (2007), 405–417.Google ScholarCross Ref
Julian Hitschler, Esther Van Den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.Google ScholarCross Ref
Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and linguistic computing 17, 4 (2002), 401–412.Google Scholar
Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 624–628.Google ScholarDigital Library
Carole J Lee, Cassidy R Sugimoto, Guo Zhang, and Blaise Cronin. 2013. Bias in peer review. Journal of the American Society for Information Science and Technology 64, 1 (2013), 2–17.Google ScholarDigital Library
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.htmlGoogle ScholarDigital Library
Wen Li and Markus Dickinson. 2017. Gender prediction for Chinese social media data. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017. 438–445.Google ScholarCross Ref
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).Google Scholar
Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557(2011).Google Scholar
Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation 47, 4 (2013), 919–944.Google ScholarDigital Library
Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evidence beyond topic and genre. In Proceedings of the fifteenth conference on computational natural language learning. 78–86.Google Scholar
Salim Sazzed. 2021. A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 1281–1288.Google ScholarCross Ref
Salim Sazzed. 2022. Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency. (2022).Google Scholar
Andrew Tomkins, Min Zhang, and William D Heavlin. 2017. Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences 114, 48(2017), 12708–12713.Google ScholarCross Ref
Teja Tscharntke, Michael E Hochberg, Tatyana A Rand, Vincent H Resh, and Jochen Krauss. 2007. Author sequence and credit for contributions in multiauthored publications. PLoS biology 5, 1 (2007), e18.Google Scholar
Vered Volansky, Noam Ordan, and Shuly Wintner. 2015. On the features of translationese. Digital Scholarship in the Humanities 30, 1 (2015), 98–118.Google ScholarCross Ref
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6Google ScholarCross Ref
Chuhan Wu, Fangzhao Wu, Tao Qi, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019. Neural gender prediction in microblogging with emotion-aware user representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2401–2404.Google ScholarDigital Library

Recommendations

Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review Data
ASONAM '22: Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

The demographic knowledge facilitates a finegrained interpretation of the user-generated review text and enables better decision-making. In this study, we aim to comprehend how various attributes of non-native English text vary across demographically ...
Read More
Native Language Identification on L2 Portuguese
Computational Processing of the Portuguese Language
Abstract
This study advances on Native Language Identification (NLI) for L2 Portuguese. We use texts from the NLI-PT dataset corresponding to five native languages: Chinese, English, German, Italian, and Spanish. We include the same L1s as in previous ...
Read More
Portuguese Native Language Identification
Computational Processing of the Portuguese Language
Abstract
This study presents the first Native Language Identification (NLI) study for L2 Portuguese. We used a sub-set of the NLI-PT dataset, containing texts written by speakers of five different native languages: Chinese, English, German, Italian, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media
June 2022
272 pages
ISBN:9781450392334
DOI:10.1145/3511095
General Chairs:
Alejandro Bellogín
Universidad Autonoma de Madrid, Spain
,
Ludovico Boratto
University of Cagliari, Italy
,
Program Chair:
Federica Cena
University of Torino, Italy
Copyright © 2022 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2022
Check for updates
Author Tags
article abstract
authorship attribution
demography prediction
gender prediction
native language identification
peer review
scientific publication
Qualifiers
- extended-abstract
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate378of1,158submissions,33%
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 55
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Revealing the Demographic Attributes of the Authors from the Abstracts of Scientific Articles

HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media

ABSTRACT

Supplemental Material

References

Cited By

Recommendations

Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review Data

Native Language Identification on L2 Portuguese

Portuguese Native Language Identification