skip to main content
10.1145/3555776.3577856acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
poster

Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution

Published: 07 June 2023 Publication History

Abstract

Language embeddings are often used as black-box word-level tools that provide powerful language analysis across many tasks, but yet for many tasks such as Authorship Attribution access to feature level information on character n-grams can provide insights to help with model refinement and development. In this paper we investigate and evaluate the importance of character n-grams within an embeddings context in authorship attribution through the use of attention scores. We perform this investigation both for English (Reuters_50_50) and Russian (Taiga) news authorship datasets. Our analysis show that character n-grams attention score is higher for n-grams that are considered to be important for authorship identification for humans. Beyond specific benefits in authorship attribution, this work provides insights into the importance of character n-grams as a unit within embeddings.

References

[1]
Jeremy Howard, S. R. (2018). Universal Language Model Fine-tuning for Text Classification. arXiv:1801.06146.
[2]
Ashish Vaswani, N. S. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
[3]
John Houvardas, E. S. (2006). N-Gram Feature Selection for Authorship Identification. Lecture Notes in Computer Science 4183, 77--86.
[4]
Upendra Sapkota, S. B.-y.-G. (2015). Not All Character N-grams Are Created Equal: A Study in Authorship Attribution. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 93--102.
[5]
Tatiana Litvinova, O. L. (2019). Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features. Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (NLPIR 2019), 9--14.
[6]
Ilia Markov1, J. B.-L. (2017). Authorship Attribution in Portuguese Using Character N-grams. Acta Polytechnica Hungarica Vol. 14, No. 3, 59--78.
[7]
Yoon Kim, Y. J. (2016). Character-Aware Neural Language Models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2741--2749.
[8]
Lyan Verwimp, J. P. (2017). Character-Word LSTM Language Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 417--427.
[9]
Piotr Bojanowski, E. G. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, Volume 5, 135--146.
[10]
Huggingface. (30 09 2022 .). Summary of the tokenizers. Huggingface: https://huggingface.co/transformers/v4.3.0/tokenizer_summary.html
[11]
Rico Sennrich, B. H. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715--1725.
[12]
Mike Schuster, K. N. (2012). Japanese and Korean voice search. International Conference on Acoustics, Speech and Signal Processing, IEEE (2012), 5149--5152.
[13]
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 66--75.
[14]
Taku Kudo, J. R. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66--71.
[15]
John Wieting, M. B. (2016). Charagram: Embedding Words and Sentences via Character n-grams. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1504--1515.
[16]
Sho Takase, J. S. (2019). Character n-gram Embeddings to Improve RNN Language Models. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 5074--5082.
[17]
Tao Shen, T. Z. (2019 03 2019 .). Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together. arXiv:1805.00912v4. Retrieved from arxiv: https://arxiv.org/pdf/1805.00912.pdf
[18]
Miguel A. Sanchez-Perez, I. M.-A. (2017). Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus (preprint version). Conference: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association (CLEF 2017). Volume: 10456, 145--151.
[19]
Xiaobing Sun, W. L. (2020). Understanding Attention for Text Classification. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3418--3428.
[20]
Richard Socher, A. P. (2013). Recursive Deep Models for Semantic Compositionality. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631--1642.
[21]
Andrew L. Maas, R. E. (2011). Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142--150.
[22]
David D. Lewis, Y. Y. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361--397.
[23]
Tatiana Shavrina, O. S. (2017). To the methodology of corpus construction for machine learning:"Taiga" syntax tree corpus and parser . PROCEEDINGS OF THE INTERNATIONAL CONFERENCE «CORPUS LINGUISTICS-2017», 78--84.
[24]
Jacob Devlin, M.-W. C. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171--4186.
[25]
He Pengcheng, L. X. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. CoRR abs/2006.03654.
[26]
Yuri Kuratov, M. A. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2019", 1--7.
[27]
Canziani, A. (30 09 2022 .). Attention and the Transformer. Retrieved from atcold: https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-3/
[28]
Rexha A, K. M. (2018). Authorship identification of documents with high content similarity. Scientometrics, 223--237.

Index Terms

  1. Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
    March 2023
    1932 pages
    ISBN:9781450395175
    DOI:10.1145/3555776
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2023

    Check for updates

    Author Tags

    1. character n-grams
    2. authorship attribution task
    3. attention score

    Qualifiers

    • Poster

    Conference

    SAC '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 34
      Total Downloads
    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media