poster

Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution

Authors:

Liliya Makhmutova,

Giancarlo SaltonAuthors Info & Claims

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Pages 939 - 941

https://doi.org/10.1145/3555776.3577856

Published: 07 June 2023 Publication History

Abstract

Language embeddings are often used as black-box word-level tools that provide powerful language analysis across many tasks, but yet for many tasks such as Authorship Attribution access to feature level information on character n-grams can provide insights to help with model refinement and development. In this paper we investigate and evaluate the importance of character n-grams within an embeddings context in authorship attribution through the use of attention scores. We perform this investigation both for English (Reuters_50_50) and Russian (Taiga) news authorship datasets. Our analysis show that character n-grams attention score is higher for n-grams that are considered to be important for authorship identification for humans. Beyond specific benefits in authorship attribution, this work provides insights into the importance of character n-grams as a unit within embeddings.

References

[1]

Jeremy Howard, S. R. (2018). Universal Language Model Fine-tuning for Text Classification. arXiv:1801.06146.

[2]

Ashish Vaswani, N. S. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

[3]

John Houvardas, E. S. (2006). N-Gram Feature Selection for Authorship Identification. Lecture Notes in Computer Science 4183, 77--86.

Digital Library

[4]

Upendra Sapkota, S. B.-y.-G. (2015). Not All Character N-grams Are Created Equal: A Study in Authorship Attribution. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 93--102.

[5]

Tatiana Litvinova, O. L. (2019). Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features. Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (NLPIR 2019), 9--14.

Digital Library

[6]

Ilia Markov1, J. B.-L. (2017). Authorship Attribution in Portuguese Using Character N-grams. Acta Polytechnica Hungarica Vol. 14, No. 3, 59--78.

[7]

Yoon Kim, Y. J. (2016). Character-Aware Neural Language Models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2741--2749.

[8]

Lyan Verwimp, J. P. (2017). Character-Word LSTM Language Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 417--427.

[9]

Piotr Bojanowski, E. G. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, Volume 5, 135--146.

[10]

Huggingface. (30 09 2022 .). Summary of the tokenizers. Huggingface: https://huggingface.co/transformers/v4.3.0/tokenizer_summary.html

[11]

Rico Sennrich, B. H. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715--1725.

[12]

Mike Schuster, K. N. (2012). Japanese and Korean voice search. International Conference on Acoustics, Speech and Signal Processing, IEEE (2012), 5149--5152.

[13]

Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 66--75.

[14]

Taku Kudo, J. R. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66--71.

[15]

John Wieting, M. B. (2016). Charagram: Embedding Words and Sentences via Character n-grams. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1504--1515.

[16]

Sho Takase, J. S. (2019). Character n-gram Embeddings to Improve RNN Language Models. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 5074--5082.

Digital Library

[17]

Tao Shen, T. Z. (2019 03 2019 .). Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together. arXiv:1805.00912v4. Retrieved from arxiv: https://arxiv.org/pdf/1805.00912.pdf

[18]

Miguel A. Sanchez-Perez, I. M.-A. (2017). Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus (preprint version). Conference: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association (CLEF 2017). Volume: 10456, 145--151.

[19]

Xiaobing Sun, W. L. (2020). Understanding Attention for Text Classification. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3418--3428.

[20]

Richard Socher, A. P. (2013). Recursive Deep Models for Semantic Compositionality. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631--1642.

[21]

Andrew L. Maas, R. E. (2011). Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142--150.

[22]

David D. Lewis, Y. Y. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361--397.

[23]

Tatiana Shavrina, O. S. (2017). To the methodology of corpus construction for machine learning:"Taiga" syntax tree corpus and parser . PROCEEDINGS OF THE INTERNATIONAL CONFERENCE «CORPUS LINGUISTICS-2017», 78--84.

[24]

Jacob Devlin, M.-W. C. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171--4186.

[25]

He Pengcheng, L. X. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. CoRR abs/2006.03654.

[26]

Yuri Kuratov, M. A. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2019", 1--7.

[27]

Canziani, A. (30 09 2022 .). Attention and the Transformer. Retrieved from atcold: https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-3/

[28]

Rexha A, K. M. (2018). Authorship identification of documents with high content similarity. Scientometrics, 223--237.

Index Terms

Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...
Character N-Grams Translation in Cross-Language Information Retrieval
NLDB '07: Proceedings of the 12th international conference on Applications of Natural Language to Information Systems: Natural Language Processing and Information Systems

This paper describes a new technique for the direct translation of character n-grams for use in Cross-Language Information Retrieval systems. This solution avoids the need for word normalization during indexing or translation, and it can also deal with ...
A first approach to CLIR using character n-grams alignment
CLEF'06: Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

This paper describes the technique for translation of character n-grams we developed for our participation in CLEF 2006. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

March 2023

1932 pages

ISBN:9781450395175

DOI:10.1145/3555776

Conference Chairs:
Jiman Hong
Soongsil University, South Korea
,
Maart Lanperne
Tallinn University, Estonia
,
Program Chairs:
Juw Won Park
University of Louisville, USA
,
Tomas Cerny
Baylor University, USA
,
Publication Chair:
Hossain Shahriar
Kennesaw State University, USA

Copyright © 2023 Owner/Author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Check for updates

Author Tags

Qualifiers

Poster

Conference

SAC '23

Sponsor:

SIGAPP

SAC '23: 38th ACM/SIGAPP Symposium on Applied Computing

March 27 - 31, 2023

Tallinn, Estonia

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
34
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten